Python interfaces a la Golang

Posted by Javier Asensio-Cubero on Mar 21, 2019

For the last four years I have been jumping back and forth between Python and Go. The majority of the work I carry out as a Machine Learning Engineer at Octopus Energy is focused on Python and the typical data science stack. We are working on medium/big scale data processing projects using all the usual suspects: keras, scikit-learn, pandas, airflow etc. But at the moment, I’m not doing any Go.

There are two major things that I miss from Go when working on a Python project: static typing and interfaces. At Octopus Energy we use Python’s type hinting in order to avoid silly mistakes (the ones I do more often) with some success but there are intrinsic limitations.

While dealing with data streams and especially with the io package in Python, it becomes a burden trying to keep a flexible interface while reusing elements of the standard library. Consider the following example:

import pandas as pd

def transform(input_file: str) -> pd.DataFrame:
    df = pd.read_csv(input_file, index="index")
    return _transform(df)

This implementation has some drawbacks: passing a file-path string for pandas to load a CSV file makes testing difficult as you would need to create an actual CSV. Also, we restrict ourselves to local files for applying our data transform - however you might want to load the stream from other sources, such as HTTP connections or a S3 bucket.

You might be tempted to rewrite this function using io classes.

import pandas as pd
import io

def transform(input_io: io.TextIO) -> pd.DataFrame:
    df = pd.read_csv(input_io, index="index")
    return _transform(df)

This is more sensible: we’re using a class offered by the standard library and pandas will like it as it uses an intrinsic reading protocol (more on this soon). This means that as long as whatever you send down to pd.read_csv has a method/function called read, pandas will be able to load the data. But it’s difficult to extend the classes in the io package. Many ftp or S3 readers won’t inherit from these classes as typing is not really big in the python ecosystem. Also, building fake io subclasses to test is an ordeal as you need to create a bunch of empty stubs to extend them.

Python protocols

Let’s focus on the concept of protocol in python. This not something new, you might know that calling len() in any object that contains the method __len__ will just work; __getitem__ and the pickling protocol are more examples. Python, being dynamically typed, will check programmatically that __len__ exists before calling it during runtime.

If you’ve written any Go code, this will sound familiar, but of course you would enforce your protocol into an actual interface. In Go, there’s no concept of classes or inheritance, just structs. Structs can have functions attached to them called methods (sorry the nomenclature makes the comparison a bit messy). You define interfaces as contracts that your struct needs to meet, just like protocols! But being strictly typed, this is enforced during compilation time.

Relying on type hinting to play along with protocols is known as structural subtyping in python.

You might be wondering how this is different from inheritance. Let’s have a look at the following example where we will create a useful function to print the length of different objects.

class LenMixin:                                        
    def __len__(self) -> int:                          
        raise NotImplementedError()                    
                                                       
                                                       
class OneLengther(LenMixin):                           
    def __len__(self) -> int:                          
        return 1                                       
                                                       
                                                       
def print_with_len(has_len: LenMixin):                 
    print(len(has_len))                                
                                                       
                                                       
print_with_len(OneLengther())  # yay                   
print_with_len(list([1, 2, 3]))  # ney! type error     

Mypy will complain about the last line in this example. The reason being the list object is not a subclass of LenMixin. That’s unfortunate as the list class actually offers all the functionality we need. Of course, we could create a MyList class that inherits from both list and LenMixin but it adds tons of boilerplate code.

The package typing_extensions allows us to use protocols in a strictly typed fashion. A Protocol defines a set of functions that a class needs to have in order to be used as an argument to another function.

from typing_extensions import Protocol                                                                        
                                                                                                              
                                                                                                              
class Sizable(Protocol):                                                                                      
    """A custom definition of https://mypy.readthedocs.io/en/latest/protocols.html#sized """                  
                                                                                                              
    def __len__(self) -> int:                                                                                 
        pass                                                                                                  
                                                                                                              
                                                                                                              
class MySizable:  # No superclass                                                                             
    def __len__(self) -> int:                                                                                 
        return 1                                                                                              
                                                                                                              
                                                                                                              
def print_sizable(sizable: Sizable):                                                                          
    print(len(sizable))                                                                                       
                                                                                                              
                                                                                                              
print_sizable(MySizable())  # yay                                                                             
print_sizable(list([1, 2, 3]))  # yaaaay!                                                                     

In this way we just need to have the __len__ method in our class (or function in our module) in order to fulfill the print_sizable requirements.

The whole concept of dependency injection in Go revolves around the idea of having interfaces as a contract. Your python object or your Go struct needs to fulfill this contract in order to be used as a parameter to a function. This blog post by Christoph Berger explains it way better, and this video by the great Francesc Campoy will give you a deep insight of the concept.

Coming back to the reader example, here we define a protocol that allows reading streams.

from abc import abstractmethod
from typing_extensions import Protocol
from typing import Any

class Reader(Protocol):
    """Reader protocol sets the contract to read streams."""
    @abstractmethod
    def __iter__(self):
        """Make the reader an iterable."""
        pass
    
    @abstractmethod
    def read(self, size: int = -1) -> Any:
        """Read the given amount of bytes."""
        pass

The reason why our reader needs to be iterable comes from panda’s file objects.

Now we can rewrite our tranform_cvs function to use this protocol as an input.

import pandas as pd
from abc import abstractmethod
from typing_extensions import Protocol
from typing import Any

class Reader(Protocol):
    """Reader protocol sets the contract to read streams."""
    @abstractmethod
    def __iter__(self):
        """Make the reader an iterable."""
        pass
    
    @abstractmethod
    def read(self, size: int = -1) -> Any:
        """Read the given amount of bytes."""
        pass


def transform(input_reader: Reader) -> pd.DataFrame:
    df = pd.read_csv(input_reader, index="index")
    return _transform(df)

From this implementation we obtain a function that accepts not only any file-like-object, such as files, io.StringIO or io.BytesIO, but also any other third party libraries to stream S3 buckets, FTP, sFTP, HTTP, as long as they are iterables and have a read(i:int=-1) method. No inheritance needed, no extra glue code to make this work with third party libraries or testing fakes.

Protocols and structural subtyping make injecting dependencies and creating fake objects for testing easy and elegant.