GFickel Blog

Apple Pies and License Plate Recognitions from Scratch

2024-12-11T15:00:00+00:00

The idea of creating something from scratch is both intimidating and exciting. It is tough to stare at a blank screen (usually a programming IDE), waiting for us to type the first characters of a big new project. But this is also a moment full of new possibilities, experiments, and learning. And as Carl Sagan once said, “if you wish to make an apple pie from scratch you must first invent the universe”. With that cosmic perspective in mind, let’s set our expectations straight on what we mean by “from scratch” and what we want to achieve:

Any Deep Learning framework allowed: pytorch, JAX, keras, etc.
Use the fewest libraries possible: this is both good for local debugging, general code understanding (i.e., our code does not jump into a black box), and makes it much more flexible, such as upgrading our frameworks to newer versions.
Should run fast on CPU: the GPU world is great, but I want something that runs somewhat fast on CPU. I’ll say that 100ms on my low/midrange notebook is good enough (AMD Ryzen 7 5700U).
Simple solution: ideally I would want a single end-to-end network, i.e., pass an image and receive the list of plates with their text, but this might be too challenging…

So with that in mind, what is a License Plate Recognition (aka LPR)? It’s just a system that both detects and reads the license plates from an image/video. It is commonly used in private parking lots, traffic monitoring systems, and similar applications.

Solution Pipeline

A good place to start is to examine the current state-of-the-art approaches, though license plate recognition isn’t currently a hot research topic. Drawing from my past experience (this won’t be my first nor second LPR implementation), I believe that a conceptually simple and easy to implement solution would be to tackle this problem in 2 stages:

Plate Detection: given an image or video frame, find all the license plates positions. Usually as rectangular bounding boxes, but the plate corners would be better.
Plate Recognition: for each detection, crop the plate image and run an OCR network.

This is not an end-to-end solution as I wanted, but it’s so much easier to compose and train that it seems like a good approach. This gives us two areas to research: detection and OCR.

Choosing our Networks

For detection, I had great results with SCRFD. It is a network specially tailored for face detection, and the reason why a regular Object Detector was not good enough for faces was quite interesting: most faces are small compared to the whole image. Therefore, regular CNN approaches struggle with this because their deeper layers, which are responsible for generating complex features, lose spatial resolution due to successive downsampling operations like MaxPool.

How this is solved: with a powerful neck that combines the information of several higher dimensional layers with the later and smaller ones. This allows the network to get sophisticated features even for small objects on the image. This approach combined with a carefully crafted backbone made SCRFD a really small and fast face detection network.

But why am I talking so much about faces? Well, in many scenarios, I believe that license plates also have the same problem: they appear very small within the whole image. Therefore, I believe that this approach should also work, and we are going to stick to it.

And for OCR? I’ve read many papers on what they usually call Text Recognition or Scene Text Recognition. I’ve found that many state-of-the-art papers are combining some language model to add a prior on the pure OCR. This was previously done using a dictionary and beam search, where we would get a word like “NUMBR” and it would be changed to “NUMBER”. Using a Language Model is, however, a more robust solution.

It is important, though, to check our scenario: license plates are almost random, usually only containing some simple structure such as number of characters and fixed places for numbers and letters. Using a language model just seems overkill for such simple rules, and possibly will even hurt the performance if we are not careful during the training stage.

After some more searching, I’ve found MaskOCR. It uses Vision Transformer (ViT) for encoding our words, which is, in itself, a much more intuitive approach than CNN-based methods for this particular task. The transformer can naturally subdivide our image into vertical patches, and their relationships will be given by the attention phase. I will not get into many details on how it works, but it first has an initial training process that uses masked autoencoders (MAE) to initialize the encoder part. Afterwards, we attach a decoder with a linear layer and do the final OCR predictions. It is a simple enough solution that we can implement, and it achieved really good results, so that’s our OCR network.

Implementing Them

Fortunately, SCRFD already has an open-source implementation available, which provided a great starting point. However, it uses the OpenMMLab libraries. They are awesome, and we can easily change some configs and get some really new and state-of-the-art networks. But with this great flexibility comes a serious drawback: the installation process is janky. We have to use openmim instead of pip or conda, making it harder to config our environment. Also, it is quite strict with CUDA and PyTorch versions, so we are kinda stuck with older releases.

This was a big no-go for this project, so I decided to directly get the code that I need and drop this requirement altogether. It took a bit of work, changing some interfaces and simplifying some details, but I’ve managed to do it. And in the process, I’ve learned a lot about how OpenMMDetection works, which is a great thing.

Also, I decided to use the EfficientDet BiFPN (bi-directional feature pyramid network) for the neck. It proved itself as a very strong neck, and I think that being bi-directional is a really good strategy to make the best use of our limited backbone features. And I’m calling them limited only because I’ll use the smallest backbone that I can find, and that was MobileNetV4. In the end it is a little bit different from SCRFD, but the main gist of it remains, only updating some parts.

For MaskOCR it was a bit trickier: there was no implementation available. This is not that big of a deal, though, since I was able to get the more complicated stuff from ViT Pytorch, and only had to piece everything together and set up the training process. It took a bit of work but it paid off.

Both implementations can be found here: https://github.com/gfickel/alpr

Training Everything

Training an LPR system requires both quality data and careful parameter tuning. Let’s break down the process, starting with dataset selection and preparation.

The first step on the training process is actually finding and preparing our data. I’ve found a really interesting dataset called CCPD2019. It contains over 300K annotated images of Chinese license plates, and even has some subsets with different scenarios. Those are the ones that I’m using:

ccpd_base: good set of images, used for training
ccpd_weather: images captured in heavy weather, used for validation
ccpd_challenge: used for testing

The training process was somewhat straightforward: I’ve used AdamW, dlib plateau detection to check when the learning rate should be decreased, and for the detection model, I’ve set the backbone learning rate to 1/10 of the rest of the network. All of this and the final weights can be found on my GitHub repo: https://github.com/gfickel/alpr

Hyperparameters Tested

For the Detection network, I only changed the start learning rate and used weight_decay=0.01 with the largest batch size that my GPU could handle. I did a quick check on some possible backbones such as ResNet and EfficientNet but mainly stuck with MobileNet V4 since it was providing the bigger bang for the buck.

Training MaskOCR was a little bit more complicated. Here are some key parameters:

image size: I started using 32x128, but when I changed to 48x192 I quickly noticed a bump in accuracy.
num encoder layers: I tried several combinations, but every time I used less than 8 the accuracy quickly dropped, and higher numbers stayed the same or increased overfitting. I ended up using 8.
num decoder layers: also tested several values, and 6 was the best one.
dropout: I added dropout both on encoder and decoder phases with a value of 0.25, all in the name of avoiding overfitting.
num encoder heads: either 8 or 12 were giving me good results but 12 was just a tad bit better.
embed_dim: great influence on the results. 624 was the sweet spot for me.

This network also had a tendency to overfit. I had to write my custom augmentation code and added a parameter to control its strength. Even with 300K images, heavy augmentations were fundamental in getting good results.

Results

We achieved 93% accuracy on ccpd_challenge, the hardest set and usually reserved for testing. Notice that there are some annotation problems, mostly invalid plates and humanly unreadable plates. We can argue that “unreadable” is somewhat subjective, and that the model should be able to outperform humans. However, this makes it quite challenging to determine if the mistake came from the network or the annotation. Here is a very well-behaved example:

And what about the runtime? I’ve run some tests on my personal notebook, with an AMD Ryzen 7 5700U (with a modest TDP of 15W), 12GB RAM, Ubuntu 23.04:

Detection: ~80ms
OCR (per plate): ~48ms

We’ve exceeded our initial budget of 100ms by 28ms, which is significant. We definitely can iterate further on both networks, testing the impact of some hyperparameters on the final runtime/accuracy and find some better ones. However, I’m running low on time, and I’m happy with where we are.

Missing Steps for Deploy

There is a world of difference between ideal research conditions and actually deploying a Machine Learning model. It is important to define this at the very start of the project and update our priorities and goals accordingly. Here are some questions that we should always ask:

Is it going to work on pictures or video?
Maximum latency? 100ms, 1s, 10s?
Will it run on Cloud? If so, on CPU, GPU, TPU?
Will it run on smartphones? Android, iOS? Minimum SDK and phone specs?
What metrics should we use? FAR/FRR, AuC? And what is our goal, remembering that there is no perfect system.

These questions will give us a set of constraints that we must follow: maximum latency and where should we measure it (CPU, GPU, smartphone), model size (really important for smartphones), architecture design (perhaps we can use some Android/iOS AI building blocks), etc.

Some Tips

It is a very fun and challenging process to try and make something as big as an LPR, but there are many pitfalls down the bumpy road. Here are some key tips for a much faster and productive process:

Good Logging: use a platform that makes it easy to compare multiple training sessions. I’m using Weights and Bias but you should use whatever you like.
FAST Iteration: quick iteration time doesn’t mean only making a code change and running/debugging, but also fast trains. Ideally a full trained model should take no longer than an hour. Usually you should use a smaller train dataset and some smarter way to train, such as fit_one_cycle and lr_find. This way you can quickly test several ideas before sticking to a few and doing a full, lengthy train.
Good Debug Experience: either through notebooks or through an IDE, my preferred way. Programming is hard, and tracking all the tensors shapes and their modifications is usually quite tricky, so having an easy way to debug your code along the way can make your life so much easier.
LLMs Are Quite Good: I’m slightly embarrassed to admit that I’m a late LLM adopter, but I’m finding they are really helpful. However, they make a lot of mistakes, so you should never blindly trust them, but they are awesome in several areas such as writing boilerplate code, serving as an interactive documentation for many popular libs, and explaining some concepts with code and plots.

And if my first image left you wanting an apple pie, look no further than the cooking master J. Kenji López-Alt help here.

Creating a Model Server and Making Better Wheels

2024-03-23T15:00:00+00:00

There are already some pretty good model servers with really good features, like Triton, TorchServer and TensorFlow Serving. So… why make another one when xkcd already warned us?

I took some liberties using this comic strip, but the main point remains: why try to reinvent the wheel? This is an old and trusty saying, and there is so much new stuff that we could be creating instead of redoing something that has been done by several people, often with more experience in this particular area than you. But I don’t fully buy into that. It is a good rule of thumb for the, probably, vast majority of time, but not always. As John Carmack said in his Commencement Speech at UMKC: “It’s almost perceived wisdom that you shouldn’t reinvent the wheel, but I urge you to occasionally try anyway. You’ll be better for the effort, and this is how we eventually end up with better wheels.” Getting better wheels is hard and not always guaranteed, but getting better for the effort is always the case.

So getting back to our Model Server project, I wanted something that was simple to use and could add any model that I wanted, either PyTorch, TensorFlow, or ONNX, using both CPU and GPU. Also, there is the hidden cost of using a big Open Source project that is fixing and debugging code. Don’t get me wrong, Open Source is awesome, but to immerse yourself into lots of new code, with several layers of little (and often not) documented abstractions is no easy feat. And like the following wisdom of xkcd warned us, we really should be careful when depending on a large stack of dependencies that we can barely grasp.

I will be starting with Python, since it is the language most used by ML folks, and should make our life easier when importing some more obscure and heavily code-dependent models. And to do our server gRPC seems like a great call: it is supported in a bunch of languages and defines the server interfaces through protobufs, which I quite like since it makes way harder to commit some silly errors passing and getting data from it. Let’s build it in parts, starting as simple as possible and adding new features after. If you want to look at the final code, check it out here: https://github.com/gfickel/tiny_model_server

Barebones Server

With those previous definitions in mind, we can almost start writing the skeleton of a server, we just need to figure out how to define our interface and write the appropriate protobuf. Since I mostly deal with images, I’ll start implementing a route to receive an image and return a dict with the results. Let’s start with the protobuf:

syntax = "proto3";

service Server {
  RPC RunImage(ImageArgs) returns (Response) {}
}

message ImageArgs {
    NumpyImage image = 1;
    string model = 2;
}

message Response {
    string data = 1;
}

There is a lot to unpack here. You can check the Protobuf Docs for more details, but the main point here is the declaration of a service Server that has an RPC called RunImage. This RPC takes an ImageArgs and returns a Response. Looking at a high level all seems to make sense, so let’s look a little bit closer.

ImageArgs and Response are both messages, that define how to pass and get data around to our server. Response has only a single field called data of type string. So we are getting a string back from our server after we call ImageArgs. It is not the dictionary we wanted, but we can easily encode and decode to string using json lib. Regarding ImageArgs, things get a little bit more complicated: we have a NumpyImage image that is the binary data and a string that defines what model we want. The most tricky part is the NumpyImage part, and that’s how I defined it:

message NumpyImage {
    int32 height = 1;
    int32 width = 2;
    int32 channels = 3;
    bytes data = 4;
    string dtype = 5;
}

We have the height, width, and number of channels as integer types, the numpy dtype stored as a string, and the binary data on data. With all of this, we can almost send and receive numpy images (matrices) at will, we just need 2 things: learn how to access those datatypes in our Python and write some code to help us encode and decode to this format. To solve the first problem we must “compile” our protobuf file that will generate some Python code that we’ll use. Here’s the command:

python -m grpc_tools.protoc -I. --python_out=./ --pyi_out=./ --grpc_python_out=./ simple_server.proto

This command will read our protobuf file and generate two new python files: simple_server_pb2.py and simple_server_pb2_grpc.py. I’ll mention them when we use them, but the main point is that they provide interfaces to our protobuf definitions.

And now, on the code to encode and decode our numpy images to the Protobuf messages:

np_dtype_to_str = {
    np.dtype(np.uint8)   : 'uint8',
    np.dtype(np.float32) : 'float32',
    np.dtype(np.float64) : 'float64',
}
str_to_np_dtype = {v: k for k,v in np_dtype_to_str.items()}

def numpy_to_proto(mat):
    dtype_str = np_dtype_to_str[mat.dtype]

    return simple_server_pb2.NumpyImage(
            height=mat.shape[0],
            width=mat.shape[1],
            channels=(1 if len(mat.shape)==2 else mat.shape[2]),
            data=mat.tobytes(),
            dtype=dtype_str
        )

def proto_to_numpy(image):
    dtype = str_to_np_dtype[image.dtype]

    np_image = np.frombuffer(image.data, dtype=dtype)
    if image.channels == 1:
        shape = (image.height, image.width)
    else:
        shape = (image.height, image.width, image.channels)

    return np_image.reshape(shape)

It is a quite straightforward code, with two different functions: one to encode a numpy image to a protobuf message, and another to do the opposite. I’ve hardcoded the supported dtypes on np_dtype_to_str, but it is trivial to expand to other ones. You may notice that we are using simple_server_pb2 here, and that’s one of the automatically generated Python codes that I’ve mentioned. Ok, finally we have defined our interface and created our protobuf accordingly, we are just missing the most important part: the server! And here we have it:

class SimpleServer(simple_server_pb2_grpc.SimpleServer):

    def __init__(self):
        self.models = {}

    def RunImage(self, request, context):
        model_name = request.model
        image = proto_to_numpy(request.image)
        # results = self.models[model_name).run(image)
        results = {'score': 42.0}

        return simple_server_pb2.Response(
                data=json.dumps(results))

def serve():
    server = grpc.server(
        futures.ThreadPoolExecutor(max_workers=8))
    route_servicer = SimpleServer()
    server_pb2_grpc.add_SimpleServerServicer_to_server(
        route_servicer, server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

Ok, now we have finally a server running! But first, let’s look at this code and see how it is done. First, we defined a class called SimpleServer that inherits another SimpleServer from simple_server_pb2_grpc, the other one of those automatically generated codes from protobuf. It provides all the nitty gritty stuff to create a gRPC service, and we just need to define our RPC routes as methods. In our case, that is RunImage, which gets an ImageArgs message, decodes our image back to numpy with proto_to_numpy, and gets the desired model from request.model, calls it and return a Response message. You may notice that we are faking running a model and returning a fixed response. This is the subject of our next Section.

With this SimpleServer in hand, we just need to set up a gRPC server and run it. There is not much going on there, we are basically creating a server with max_worker threads, adding our SimpleServer service to this server, defining a port to run it, and starting it. You can check out this official tutorial to get some more insights, but we’ll get back to those in future sections.

Adding Models

Ok, we have a model server that it is doing “everything”, except run models. Let’s tackle that. Recording one of our goals: it must be easy to add new models, even if they contain lots of Python code. I believe that one of the easiest things would be to create a defined interface that each model must comply with, and our model server loads all of them. For instance, we can have this base interface as the following:

class ModelInterface(abc.ABC):

    def get_input_shape(self):
        """ Returns numpy shape """
        return None

    @abc.abstractmethod
    def run(self, data, args):
        """ Returns a response dict """

    def run_batch(self, data, args):
        """ Same interface as run, however, the images batch is encoded on
            a single numpy image. If the model does not provide a batch option
            just call it once for every input data.
        """
        return [self.run(x, args) for x in data]

And our model code would be something like this:

class Model(ModelInterface):

    def __init__(self):
        """ Here you may load an instance of your model """
        self.model = 'load my model here'

    def get_input_shape(self):
        """ Returns just like numpy shape """
        return (1080, 1920, 3)

    def run(self, data, args):
        return [('object1',0.3),('object2',0.5)]

The idea is to inherent ModelInterface, load our model on __init__, and define, at least, the method run. Since all of this is just plain Python, we can do everything we want within run, which should make it quite simple to add here. For example, I’ve already used [MTCNN][https://github.com/davidsandberg/facenet/tree/master/src/align] which has quite a lot of Python code to deal with 3 different Neural Networks used in a cascade fashion, and it was straightforward to add it here.

Now the only problem left is to make our server find those models. I’m using a simple solution, consisting of creating a new folder within models/ with the name of your model, and inside it, you will have an __init__.py with this class Model that implements the run method, and you can put whatever extra necessary code in there. Inside our server we can check all the available models like this:

all_models = os.listdir('models/')

The last piece of the puzzle is to actually import and instantiate those models to a usable Python object. You can do this with https://docs.python.org/3/library/importlib.html, which enables us to import a module whose path is decided at runtime. In the end, we can have something like this on our server:

for model in os.listdir('models/'):
    model_path = f'models.{model}'
    module = __import__(model_path, globals(), locals(), ['object'])
    importlib.reload(module)
    self.models[model) = module.Model()

With this code, we are instantiating all of our models and putting them into a dict, with its name as key. So, we can update our server code to be like this:

class SimpleServer(simple_server_pb2_grpc.SimpleServer):

    def __init__(self):
        for model in os.listdir('models/'):
            model_path = f'models.{model}'
            module = __import__(model_path, globals(), locals(), ['object'])
            importlib.reload(module)
            self.models[model) = module.Model()

    def RunImage(self, request, context):
        model_name = request.model
        image = proto_to_numpy(request.image)
        results = self.models[model_name).run(image)

        return simple_server_pb2.Response(
                data=json.dumps(results))

def serve():
    server = grpc.server(
        futures.ThreadPoolExecutor(max_workers=8))
    route_servicer = SimpleServer()
    server_pb2_grpc.add_SimpleServerServicer_to_server(
        route_servicer, server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

Finally, we have a working model server! But wait, how do I call it? I can add as many models as I want, but how do I actually use this in my code? That’s a question for the next Section.

Calling Model Server

We have a fully functional model server, but all will be in vain if it is a pain to use. Fortunately, we can make things easier by creating a Model Client, that your code can use. Ideally, we want to establish a client for each model within a single line, and another one to run the model. It really should be that simple, and the complexity should be invisible to the user. A good practice when defining interfaces is to write the final code how you think it should behave, with all (and only) information necessary. This is our end goal:

model = ModelClient(model='example_image', ip='localhost', port=50000)
res = mode.run_image(image)

I’ve mentioned hiding the complexity but really there is not much to it. Mostly is just making sure that we managed to connect to our server and some boilerplate code to convert data back and forward. Let’s look at what it looks like:

class ModelClient(abc.ABC):
    def __init__(self, model: str, ip: str, port: str='50000', timeout: int=60*5):
        self.model = model
        self.channel = None
        self.stub = None
        self.size = None

        self._connect(ip, port, timeout)

    def _connect(self, ip: str, port: str, timeout: int):
        channel = grpc.insecure_channel(f'{ip}:{port}')
        self.stub = server_pb2_grpc.ServerStub(channel)

        begin = time.time()
        while self.size is None: # keep trying to connect until timeout
            try:
                response = stub.GetInputSize(
                    server_pb2.StringArg(data=self.model))
                self.size = json.loads(response.data)
            except grpc._channel._InactiveRpcError:
                time.sleep(1)
            if time.time()-begin > timeout and self.size is None:
                raise ConnectionTimeout(ip, port, timeout)

    def _get_image_arg(self, image: np.array):
        image_proto = utils.numpy_to_proto(image)
        return server_pb2.ImageArgs(
                image=image_proto,
                model=self.model)

    def run_image(self, image: np.array):
        """Runs an image into the given model."""
        if image is None or min(image.shape[0:2]) <= 2:
            return {'error': 'Bad image'}
        run_arg = self._get_image_arg(image)
        response = self.stub.RunImage(run_arg)
        return json.loads(response.data)

That’s a lot of code, so let’s start at the beginning. Our ModelClient takes as a parameter the model name (defined by its folder name), the ip and port of the server, and a connection timeout. On __init__ we just call _connect which creates a channel and a stub to the server. The idea here is to have a single channel and stub per model that we always keep open, so on every new model call we don’t have to deal with all the handshaking stuff.

Notice that on _connect we keep trying to call GetInputShape RPC in order to see if our model server is on and responding. It is quite common to launch the model server at the same time as the application, and the model server may take longer to be up and running, so it is good to have a timeout to keep trying for a little bit. After we get our model input shape we are done and ready.

To use our client we are going to call the run_image method, which takes an image and returns a dict. We are using a helper method called _get_image_arg to format our ImageArgs protobuf message, and calling our server through our stub. Finally, we are getting the results from .data, which is a string, and converting it back to a dict with json.loads.

And that’s it, quite easy for our end user. Notice that despite ModelClient hiding most of the complexities, it is still quite in reach for any user to debug its code and make changes as they see fit. Talking about changes… what about performance?

Multiprocessing Server

Yeah, performance is key, and a simple and easy to use model server is quite limited if we can’t scale vertically on this day and age of multiple GPUs and many cores CPUs. This is super simple on other servers, like gunicorn, but things are more barebones with gRPC. We have the max_workers argument when creating a server, but those workers are threads, and in python, they do not execute parallel code. They are great when there are many stalls due to IO, for example, but they don’t help us using our several CPU cores for max performance.

Reading gRPC’s own multiprocessing example, we have to do some tricks:

Fork our server code at the right time to create multiple processes
Create a connection with the option so_reuseport. This makes it possible for all of our forks to share the same port, and the Unix kernel will be responsible for doing the load balancing
This kernel load balancing doesn’t work if we want to keep our connection open to the server, since it will always be calling the same exact worker. We have to do load balancing manually

First, let’s create those several process parallel workers. We can do this by changing our server code a little bit:

def _run_server(bind_address):
    """Starts a server in a subprocess."""
    options = (('grpc.so_reuseport', 1),)
    server = grpc.server(
        ThreadPoolExecutor(max_workers=8,),
        options=options)
    server_pb2_grpc.add_ServerServicer_to_server(ServerServicer(), server)
    server.add_insecure_port(bind_address)
    server.start()
    server.wait_for_termination()

@contextlib.contextmanager
def _reserve_port(port_number):
    """Find and reserve a port for all subprocesses to use."""
    sock = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
    sock.bind(('', port_number))
    yield sock.getsockname()[1]

def main():
    with _reserve_port(PORT_NUMBER) as port:
        bind_address = f'[::]:{port}'
        with Pool(processes=NUM_PARALLEL_WORKERS) as pool:
            pool.starmap(_run_server, [(bind_address,) for _ in range(NUM_PARALLEL_WORKERS)])

if __name__ == '__main__':
    main()

Quite a little bit more code, so let’s dig in. First, we are calling _reserve_port with our port number. This function uses the socket library to bind to our desired port and set the SO_REUSEPORT flag so that we can fork our server and share the same port. Then we are using multiprocessing.Pool with our _run_server function that actually runs the server. This code is very similar to the old one, but now we are passing grpc.so_reuseport option to our grpc.server. That’s it, we now have a gRPC server that is running on NUM_PARALLEL_WORKERS workers in a truly parallel fashion.

The final piece of the puzzle here is the load balancer part. As previously mentioned, with this multiprocessing approach, it is up to the Unix kernel to distribute incoming connections to all available workers, however, this is a non-stopper for our use case. It is way too expensive to open and close a new connection for every model call. How can we solve this?

Well, the simplest but still pretty good solution that I’ve found is to implement a route on a server that will return the number of parallel workers that it has and the current worker PID (process ID). On the client side, I’ll keep opening several connections until I’ve established at least one on each server, so the client can freely choose where to send. This means that all the load balancing is going to be on the client side… Couldn’t we do this on the server side for maximum performance?

We could, but it requires a third piece on our puzzle, that will receive all the client’s requests and call the appropriate worker. The good thing is that this middleware sees all the clients and how each server worker is operating, so it has all the information to make the best decisions. However, this solution has two major drawbacks: adds another cost of transferring data, we’ll have client->middleware->server instead of client->server, and adds another layer of complexity. Those reasons are enough for me to choose client-side load balancing, and for my use, it is good enough.

There are many options to do client-side load balancing, but let’s start with the simplest: Round Robing. Basically, for a set of N workers, first, we’ll call Worker 1, then Worker 2, and thereafter, always make sure that we are spreading the load across all workers within time. That is how I implemented it, took only one line of code and it is working great! But this is an area where we could definitely improve: choose randomly the next worker so that we are less likely to have multiple clients in sync and stressing the same workers in the same order, or perhaps get some worker usage response attached to each RPC so we could do some more clever thinking before choosing. But for now, it is good enough.

Final Version and Next Steps

Our final code is a little bit more feature complete: it has unit tests, builds a Docker image that makes it easy to use with Kubernetes for scaling it horizontally, and more interface options and error checks. You can check here.

But there are many things missing, including but not limited to:

Route to process an image and return an image. Useful for image segmentation, optical flow (returning a HxWx2 np.float32 image, most likely), and other applications. I already added ImageResponse as a message on server.proto, I just need to implement a new route.
Better client-side load balancing as we mentioned.
Some Kubernetes configs for easy horizontal scaling.
Add some configurations to environment variables, such as port number and number of parallel workers. They can be easily when running the Docker images.
Add Locust load tests.
Add support to ssl_server_credentials.

The good thing about being so small is that those things are somewhat simple to implement. And by simple I mean that there is not a lot of moving pieces here to keep track of, and they could be accomplished with a few lines of code.

Conclusion

That was a journey, but we managed to have a fully working Model Server with only 483 total lines of Python code! And that is including comments and empty lines (although I’m excluding the unit tests and example models). And if we look at our requirements.txt we have only gRPC related packages, numpy and Pillow to deal with images, and pytest for our testing purposes. That seems like a reasonable list.

In the end, I expect that the main takeaway point here is not a tutorial on “How to Create an Awesome Model Server with only 400 lines of code!!!!”, but to be an inspiration to let us explore new avenues, learn more about surrounding topics, and in the process becoming a better programmer. This experience definitely changed the way I see and judge other model servers for my projects, both for “good” and “bad”. The “bad” is that I know how simple things can be, and sometimes drives me nuts having to deal with dependencies conflicts and tons of documentation just to add my model and start testing. On the other hand, there are also the “good” parts. I do appreciate even more all the features that may sound trivial but make our lives so much easier and can be a pain to implement.

Making better wheels is definitely hard and we may not get it, but improving myself in the process is definitely a nice byproduct. And sometimes we don’t need the best high-tech wheel, just a simple one that is just perfect for our needs.

Making your GPU go BRRR: Creating a CUDA Layer in PyTorch

2024-03-13T15:00:00+00:00

I still remember the “dark ages” of research, when I was still doing my masters when it was common to find really impactful publications that provided no code. And yes, I’ve sent my fair share of emails to authors… Fortunately, this is no longer the norm, and even somewhat frowned upon. Caffe, Tensorflow, Keras, PyTorch, and even more deep learning frameworks really helped everyone to create way smaller, cleaner code, that was also easier to share.

Those frameworks are really incredible and allow us to quickly implement and test new ideas, however, they are not always the fastest way, even if they use CUDA down the line. This is definitely becoming a bottleneck. PyTorch 2 implemented a compile process to fuse layers to improve GPU usage, Flash Attention did the same by directly programming Attention in CUDA and achieved an even greater runtime improvement. Some more unorthodox solutions, such as Neighborhood Attention, also greatly benefited from manual CUDA programming.

CUDA programming may seem intimidating, at least it was for me. I first learned circa 2010 and it was a really bad development experience, but by watching an awesome video by Jeremy Howard, I’ve learned that it is indeed possible to have a much better experience. The main idea is the following:

Implement the forward and backward pass in PyTorch. This gives access to an online debugger and the full functionality of Python, like Jupyter Notebooks.
Validate the implementation with gradcheck. This somewhat magic function runs your forward pass and does numerical derivation to validate your backward pass code.
Program the CUDA Kernel for forward and backward passes using Numba, directly in Python. This is the real thing, where we are dealing with CUDA threads and possibly memory management.
Ask Chat-GPT to convert this code to C CUDA. Really, it works surprisingly well!
Use PyTorch internal functionality to compile this C CUDA to a Python module that you can use with torch tensors.
Use gradcheck again to verify that your CUDA written layer is 100% correct.

It may be a couple of hoops, but the ability to develop CUDA code in Python makes our lives so much easier. You have easier integration with debugers, and the iteration time between changes in code and running it is nearly instant, compared to the long time it takes to compile C CUDA. You may noticed that mentioned both forward and backward passes, and unfortunately, if we use CUDA for our backward pass, we can’t rely on autograd to get this for us. But fortunately, we have this amazing function from PyTorch, gradcheck, that will validate for us if our backpropagation is indeed correct.

We need some kind of end goal, and for us, it will be the implementation of the Sigmoid activation inspired by David Oniani. You’ll see that it has some interesting characteristics that will help us explore interesting (and important) aspects of creating a performant CUDA layer. And finally, all of this code can be found here

1. Forward and Backward passes in PyTorch

The idea here is to do two functions: one for the forward pass and the backward one. But first, let’s remember the formula for the sigmoid and its derivative:

\[\sigma(x) = \frac{1}{1+e^{-x}}\] \[\sigma^{'}(x) = \sigma(x)(1-\sigma(x))\]

Those are not that complicated to implement, especially the derivative that only depends on the value of the sigmoid that we already computed on the forward pass. However, this sigmoid equation does present some numerical instabilities, so it is better to implement the following:

\[\sigma(x)=\begin{cases} \frac{1}{1+e^{-x}} & \text{ if } x>=0 \\ \frac{e^{x}}{1+e^{x}} & \text{ if } x<0 \end{cases}\]

With this in mind, we can generate the following Python code:

def sigmoid_forward_torch(input):
    out_tensor = torch.empty_like(input)
    positive_mask = input >= 0
    out_tensor[positive_mask] = 1. / (1. + torch.exp(-input[positive_mask]))
    out_tensor[~positive_mask] = torch.exp(input[~positive_mask]) / (1. + torch.exp(input[~positive_mask]))
    
    return out_tensor

def sigmoid_backward_torch(input):
    return input * (1 - input)

Notice that I’ve used a variable called positive_mask to create an index to identify positive and negative input values. Other than that, the code is somewhat straightforward.

2. Check our Derivatives

Now that we have a Python code to do our forward and backward pass we can test if they are coherent with each other. In other words, we will use gradcheck from PyTorch to run a forward pass, compute numerically what the derivative should be, and check our backward pass result. But first, we must set it within its autograd format. It is not that complicated, and stays like this:

class Sigmoid(torch.autograd.Function):
    """The Sigmoid activation function."""

    @staticmethod
    def forward(ctx, input: torch.Tensor) -> torch.Tensor:
        """Performs a forward pass."""

        out_tensor = torch.empty_like(input)
        positive_mask = input >= 0
        out_tensor[positive_mask] = 1. / (1. + torch.exp(-input[positive_mask]))
        out_tensor[~positive_mask] = torch.exp(input[~positive_mask]) / (1. + torch.exp(input[~positive_mask]))
        
        ctx.save_for_backward(out_tensor)

        return out_tensor

    @staticmethod
    def backward(ctx, grad_output: torch.Tensor) -> torch.Tensor:
        """Performs a backpropagation."""

        (result,) = ctx.saved_tensors
        grad = result * (1 - result)
        return grad_output * grad

Notice that both on forward and backward pass we are dealing with an additional variable: ctx. This is our context, that we can use to save some data on our forward pass to use on backward. This is quite handy for our Sigmoid since the backward pass is a simple formula that uses the forward pass result, so we save it on the context for our backpropagation.

Finally, on the backward pass, we get the sigmoid result that we stored on ctx and use it to compute its derivative. But we have another input, that is the input derivative that is being propagated to our layer. So our final gradient is this derivative multiplied by our sigmoid derivative.

With this in hand, we can call the following function to check if everything is correct:

sigmoid = Sigmoid.apply
data = torch.randn(4, dtype=torch.double, requires_grad=True)

if torch.autograd.gradcheck(sigmoid, data, eps=6e-4, atol=1e-7):
    print('gradcheck successful :D')
else:
    print('gradcheck unsuccessful :D')

If everything is correct we are ready to think about how to implement it in CUDA, otherwise, we can back up and check what we did wrong.

3. CUDA Implementation using Numba

(I will not dive into all the details on how CUDA works, but I suggest you check this video by Jeremy Howard to see a great explanation about it!)

The first thing we need to do is decide how we are going to model this in CUDA. I believe the most sensible approach is to use a single thread for each element on the input Tensor, for both forward and backward passes. And to finally implement it, we can use the Numba library, which is a JIT compiler for Python with support for CUDA, SIMD, and even threading. But for our case, we are more interested in the CUDA dev environment, especially the CUDA simulator.

To start, the first thing we must do is set NUMBA_ENABLE_CUDASIM=’1’ as an environment variable before we import Numba. Then we just need to add the @cuda.jit decorator on top of our CUDA kernel function and we are good to go!

Let’s start with the following code for both the forward and backward passes:

from numba import cuda
import torch

@cuda.jit
def sigmoid_forward(input, input_len, out):
    cbi,cbd,tid = cuda.blockIdx,cuda.blockDim,cuda.threadIdx
    idx = cbi.x * cbd.x + tid.x

    if idx >= input_len:
        return
    
    if input[idx] >= 0:
        res = 1. / ( 1. + math.exp(-input[idx]) )
    else:
        res = math.exp(input[idx]) / ( 1. + math.exp(input[idx]) )

    out[idx] = res

@cuda.jit
def sigmoid_backward(input, input_len, out):
    cbi,cbd,tid = cuda.blockIdx,cuda.blockDim,cuda.threadIdx
    idx = cbi.x * cbd.x + tid.x

    if idx >= input_len:
        return
    
    out[idx] = input[idx]*(1-input[idx])

There is a lot to unpack here, so let’s start with the first lines. We are accessing cuda.blockIdx and cuda.threadIdx to get our block and thread indexes, and cuda.blockDim to know how many threads we have per block. And since we are using a single thread to compute a single value from our input tensor, we get our final index with

\(idx = B_{index} * B_{size} + T_{index}\),

where \(B_{size}\) is the number of threads per block, \(B_{index}\) and \(T_{index}\) are the block and thread indexes.

Having our current index, we must check if this index is within our input tensor size, and return without doing anything if it is not. Those cases will happen when the total number of threads, i.e. number of thread blocks times block size, is not exactly the same as the input size.

If everything is correct, we will take the current value from input at location \(idx\) and calculate our Sigmoid. Nothing too fancy here. But we can test with the following code:

def sigmoid_numba(input, fun, tw=16, gradcheck=False):
    (input_len,) = input.shape
    out = torch.zeros(input_len, dtype=torch.float32)
    out = out.contiguous().cuda()
    tpb = tw
    blocks = cdiv(input_len,tpb)
    fun[blocks, tpb](input, input_len, out) 
    return out
    
input = torch.as_tensor([0.3, -100000, 100000, 0.5, -0.5], dtype=torch.float32)
input = input.contiguous().cuda()

res = sigmoid_numba(input, sigmoid_forward, 1)
grad = sigmoid_numba(res, sigmoid_backward, 1)

I’ve created an auxiliary function called sigmoid_numba to encapsulate the important (and boring) code necessary to allocate our output tensor and calculate an appropriate number of threads per block and thread blocks. Those configurations have some upper limits depending on your CUDA GPU, and the optimal value for each also depends on the GPU version. But for now, we are just going with some numbers that somewhat seem right, and in the end, we can run a small benchmark to decide the best values for our particular GPU. And finally, notice that our input tensor is calling two functions: contiguous() and cuda(): contiguous makes sure that our tensor is contiguous in memory since we are accessing it like a single dimensional array; cuda returns a copy of our tensor in CUDA memory.

And that’s it, with this code you are programming a CUDA kernel, but with the big difference that we can use a debugger and step to our code as we wish, and with a much smaller iteration time :). Notice that it is best to set \(B_{size}=1\) when doing breakpoints since the debuggers usually don’t work well with multiple threads calling a breakpoint at the same time.

This Numba CUDA development is way easier, and if we change our env variable to NUMBA_ENABLE_CUDASIM=’0’ we can run this code that Numba will compile it to CUDA for us, and we can see the performance that we should get. For some reason, the direct implementation in C CUDA is usually faster, with differences of 2x to be expected, but even then it should show us how fast our final implementation should be. Notice, however, that without CUDA Simulator enabled we will lose the ability to debug our code and use numpy/torch functions. You can check out here what is supported.

4. Calling out chat-GPT to Help Us

The Numba development is there to help us, but the final goal is to generate a C CUDA kernel that we can directly call on PyTorch. Fortunately, Chat-GPT is plenty capable of doing this! I’ve pasted the following query, followed by the Numba code: “Convert the following python code to C CUDA kernel. Also add a function that uses torch library to pass the input arguments, call the CUDA kernel, and check for errors. The function must receive torch::Tensor as input and return the output as torch::Tensor.”

And it gave me something really close to this:

#include 

__global__ void sigmoid_forward_cuda_kernel(const float* input, int input_len, float* out) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < input_len) {
        float res;
        if (input[idx] >= 0) {
            res = 1. / (1. + expf(-input[idx]));
        } else {
            res = expf(input[idx]) / (1. + expf(input[idx]));
        }

        out[idx] = res;
    }
}

torch::Tensor sigmoid_forward_cuda(torch::Tensor input) {
    CHECK_INPUT(input);
    // Get the data pointers and sizes
    float* input_data_ptr = input.data_ptr<float>();
    int input_len = input.numel();

    // Allocate output tensor on GPU
    torch::Tensor out_tensor = torch::empty_like(input);

    // Get the data pointer for the output tensor
    float* out_data_ptr = out_tensor.data_ptr<float>();

    // Set block and grid dimensions
    int threads_per_block = 256; // You may adjust this based on your specific GPU capabilities
    int num_blocks = (input_len + threads_per_block - 1) / threads_per_block;

    // Launch CUDA kernel
    sigmoid_forward_cuda_kernel<<<num_blocks, threads_per_block>>>(input_data_ptr, input_len, out_data_ptr);

    // Synchronize to ensure the kernel is done before proceeding
    cudaDeviceSynchronize();
    C10_CUDA_KERNEL_LAUNCH_CHECK();

    return out_tensor;
}

Notice that in the query we’ve explicitly told chat-GPT to accept a torch::tensor as input and return another one as output. This makes our lives so much easier in the following steps.

The backward pass is quite similar, and you can check it on my repo.

5. Using PyTorch to Compile C CUDA

I really didn’t know that PyTorch could do this, but if you have the dev files for CUDA and ninja build installed on your system you can pass the C CUDA code as a string and it will build it as a Python module for you. So first, to set things up we must have some auxiliary functions (thanks to Jeremy Howard), that you can check out here. The most important bit is a helper function to call load_inline from PyTorch. It enables us to pass a C CUDA code as a string and compile it to a Python model containing the kernel as a Python function. It is quite amazing.

So, let’s compile our C CUDA kernel! Here are the steps:

cuda_src = FORWARD_PASS_CUDA_CODE_FROM_CHAT_GPT
fname = 'sigmoid_forward_cuda'
cpp_src = 'torch::Tensor sigmoid_forward_cuda(torch::Tensor input);'

module_forward = load_cuda(cuda_src, cpp_src, [fname])

input = torch.as_tensor([0.3, -100000, 100000, 0.5, -0.5], dtype=torch.float32)
input = input.contiguous().cuda()
res = module_forward.sigmoid_forward_cuda(input)

And that’s it! But first, let’s explain those lines a little bit. First, cuda_src is a Python string containing our code that was so gently translated to us by chat GPT. fname is the function name that we want to expose as a function in our compiled module, and cpp_src is the C++ code that is compiled with our CUDA kernel, and all it has is the declaration of our function. With all of this, we can finally call our helper load_cuda, defined in our utils.py if you want to check it out, and it returns our new Python module with our sigmoid_forward_cuda function.

For the backward pass, it is mostly the same process, as expected. Here it is:

cuda_src = BACKWARD_PASS_CUDA_CODE_FROM_CHAT_GPT
fname = 'sigmoid_backward_cuda'
cpp_src = 'torch::Tensor sigmoid_backward_cuda(torch::Tensor input);'

module_backward = load_cuda(cuda_src, cpp_src, [fname])

grad = module_backward.sigmoid_backward_cuda(res)

6. Check our Gradients Again

Great, we have both our forward and backward passes implemented in CUDA! However, are they correct? Did chat gpt make some silly mistake on the translation part? Well, at least we can check if the backward is indeed the correct derivation for the forward pass. Just like we did in step 2, we must call checkgradients. And to do this, first, we must adhere to the autograd interface, like this:

class CUDASigmoid(torch.autograd.Function):
    @staticmethod
    def forward(ctx, data: torch.Tensor) -> torch.Tensor:
        result = module_forward.sigmoid_forward_cuda(data)
        ctx.save_for_backward(result)
        return result

    @staticmethod
    def backward(ctx, grad_output: torch.Tensor) -> torch.Tensor:
        (result,) = ctx.saved_tensors
        grad = module_backward.sigmoid_backward_cuda(result)
        return grad_output * grad

Not that bad, if you ask me, and not that different from the one from step 2. And now, for the finale:

sigmoid = CUDASigmoid.apply
data = torch.randn(4, dtype=torch.float32, requires_grad=True).contiguous().cuda()

# Changing eps and atol since we are dealing with float32
if torch.autograd.gradcheck(sigmoid, data, eps=5e-4, atol=1e-7):
    print('gradcheck successful :D')
else:
    print('gradcheck unsuccessful :D')

You may have noticed a small, but very important change here: we are using float32 instead of double. Our CUDA implementation only deals with float32, so we can’t test with float64. However, this presents some challenges for our gradcheck, since floating point errors are way more present, and we end it up having to change our eps to a higher value. I’ve tested with our vanilla PyTorch implementation to get a “correct” value for it, and then plug it back here. This is not a good practice, but in order to keep our CUDA code simpler I’ve avoided supporting other types than float32.

With those caveats aside, our gradcheck should be passing and we are officially golden, our CUDA Sigmoid implementation is over!

Conclusions

Uou, that was a long post. However, I tried to skim only the not-critical details and explain in greater detail the development pipeline. That is the key point that you should be taking from here: how to make CUDA development less sucky. And by using this PyTorch feature to compile CUDA code, we can even run CUDA kernels on Google Collabs! You can check my Jupyter Notebook here and give it a try!

I do believe that this is an interesting knowledge to have, and in this day and age of huge LLMs, being able to tackle some performance bottlenecks can have a great impact as mentioned in my introduction. In the end, it somewhat boils down to what my older sister always told me: “Knowledge doesn’t occupy space” :)

Welcome to Jekyll!

2024-02-22T22:55:11+00:00

You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.

Jekyll requires blog post files to be named according to the following format:

YEAR-MONTH-DAY-title.MARKUP

Where YEAR is a four-digit number, MONTH and DAY are both two-digit numbers, and MARKUP is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works.

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll Talk.

100x speedup on Python with just a touch of C++

2022-05-13T15:00:00+00:00

Python is a great language. I still remember my first contact with Python2 some 8 years ago, and I was amazed by how clean and expressive it was. And now, with Python3, a lot has changed. It is now the de facto language for machine learning (so long, Matlab!), and lots of amazing stuff have been built with it.

All is good and dandy, however from time to time I’ve encountered a brick wall when working on Python: how slow it is. Don’t get me wrong, if you are using libs to do your heavy processing, such as NumPy, you are good to go. But it’s important to notice that the core of NumPy is not Python, and for a reason. It’s just not the language for that.

For most cases, you can use such libs and pass those crunch-intensive stuff to them, but sometimes you want something not so conventional and that does not conform with such limitations. And then you end up writing two nested fors in Python, processing a Full HD image, and you want to cry…

Fortunately, we can write those code hot spots in C++, and it is surprisingly simple to do it and seamlessly integrate with Python. However, this opens another can of worms that is C++, and its dependencies and compatibilities. For anyone that had to target Linux, Windows in both 32 and 64 bits should know what I’m talking about. So for me it is of the utmost importance that it can be used seamlessly in any platform without any dependencies other than a C++ compiler.

So upfront I’m already discarding Boost.Python and PyBind11. I’ve used both, and usually prefer PyBind11 since it is much easier to manage on different platforms. But one dependency is one too many. And as I will show it now, you don’t need them for most cases.

Let’s start with a very simple and naive example: normalize the contrast of a black and white image.

import numpy as np

def naive_contrast_image(image):
    result = np.zeros(image.shape, dtype=np.uint8)
    min_color, max_color = np.min(image), np.max(image)
    delta_color = max_color-min_color
    for row in range(image.shape[0]):
        for col in range(image.shape[1]):
            pixel = image[row,col]
            result[row,col] = 255*(pixel-min_color)/delta_color

    return result

So this code generates the following result:

This is a very simple and naive example that could (and should) be done using NumPy. But let us do this in C++.

The first difference in C++ is that you should specify the variable types. So let us define image as an np.uint8 array, and the resulting image with the same type. On C++ this can be represented as unsigned char. Let’s take a look at our implementation. On contrast_image.h:

#include 
#include 

extern "C" {

void cpp_contrast_image(const unsigned char *image, int height, int width, unsigned char *outResult);

} // extern "C"

And contrast_image.cpp:

#include "contrast_image.h"

void cpp_contrast_image(const unsigned char *image, int height, int width, unsigned char *outResult) {
    auto vec = std::vector<unsigned char>(image, image+width*height);
    auto minmax = std::minmax_element(vec.begin(), vec.end());
    float min = (float)*minmax.first;
    float max = (float)*minmax.second;
    float delta_color = max-min;
    for (int row=0; row<height; row++) {
        for (int col=0; col<width; col++) {
            int idx = row*width + col;
            float pixel = (float)image[idx];
            outResult[idx] = (unsigned char)(255*(pixel-min)/delta_color);
        }
    }
}

There are some small but very important details here, so let’s start with the important ones.

Avoid dynamic memory allocation on C++. Python Garbage Collector will not see them so you will have to free them by yourself. Prefer to allocate the memory with NumPy. This will be shown further along.
Multiple dimensional arrays are actually just a single array with some syntactic sugar to access it. You’ll notice the direct idx calculation on the example. It is a good practice to create a function to give you the index given the desired position to avoid silly bugs.
Access and/or modify an invalid array position will generate the dreadful Segmentation Fault. So always be diligent with the range checks.
The function must have a C compatible interface, as we can see with the extern “C” on contrast_image.h. Usually this is not a big deal since we can use all the desired C++ stuff within the implementation on contrast_image.cpp, however we will have to implement different versions for different input types since templates are not available on the function definition :(.

Finally, returning complex objects within a C interface is not the easiest and cleanest thing to do. So for the most part I just reserve my final arguments to return my value. And also, use const on every array that you should not change and let the compiler help you find bugs.

Ok, we have a C++ code that does exactly what we want and can compile it to a lib with:

g++ -Wall -O2 -c -fPIC contrast_image.cpp

g++ contrast_image.o -shared -o libcontrast_image.so

Until now I did not say anything out of the ordinary, but we are surprisingly close to finishing it. Python has a useful and easy way to access a C compiled libs using ctypes. So this is how we will use our cpp_contrast_image on Python:

import ctypes
import numpy as np
from numpy.ctypeslib import ndpointer

lib = ctypes.CDLL('./libcontrast_image.so')

c_contrast_image = lib.cpp_contrast_image
c_contrast_image.argtypes = [
    ndpointer(ctypes.c_ubyte, flags='C_CONTIGUOUS'),
    ctypes.c_int,
    ctypes.c_int,
    ndpointer(ctypes.c_ubyte, flags='C_CONTIGUOUS'),
]

def contrast_image(image):
    result = np.zeros(image.shape, dtype=np.uint8)
    c_contrast_image(image, image.shape[0], image.shape[1], result)
    return result

And that’s it! You can use the new contrast_image python function with exactly the same interface, but much faster! How fast, you may ask. Well, on my i7 8550-U it went from 1229.050ms to 1.645ms on this demo image. Quite a difference! That’s actually over 700x faster, way over the promised 100x. The reason is that in our use cases we often see a speedup of a little over 100 times, so I’m trying to not over-promise here.

Just as with our C++ code, we have some important stuff to notice here. So let’s do it:

On C++ we treated our NumPy arrays as a single contiguous array. Usually that is the case, but not always! Fortunately we can explicit this constraint on Python itself, informing that our NumPy array is of type char and must be contiguous. If you call it with the wrong type an exception will be raised, saving you from a possible Segmentation Fault. You can check the available c_types here.
Remember that we are avoiding to allocate memory on the C++ code? So we are doing it here, by explicitly allocating the result image with np.zeros.
We have to explicitly point to where our compiled C++ library is to be loaded from, using ctypes.CDLL.

That’s it! Within a few lines of code you have lots of freedom to easily integrate C++ code into Python, and all of that without any dependency :)

You may be thinking that this is a silly example. And you are right. But you can do lots of stuff with this knowledge. For example, we decreased the runtime of a rasterization algorithm from 2.5s to 1.8ms, quite a hefty difference! You can read all of that on a following post to be released. But I’ll warn you, it was really easy :)

Finally, I must quote a great thinker: “With great powers comes great responsibility”. For an untrained person dabbling with pointers at C++ is a quick road to memory leaks and Segmentation Faults. Actually, even for trained ones. So it is a good practice to keep those codes as short as possible, usually not replacing a whole function but just the slow parts. And don’t forget to do lots of unit tests to catch some unusual edge cases. But if you are willing to deal with those drawbacks, a whole new world of crazy fast code awaits you!

PS.: All of this code and the benchmark script can be seen on https://github.com/gfickel/python_cpp. It is meant to only illustrate the interface between C++ and Python, so everything surrounding it is not production ready. This is up to the reader ;)

PS2.: Thanks to Michele Tanus, Gustavo Führ and Roger Granada for proofreading and greatly improving this post.