1. Introduce the problem: fast and scalable PDF file processing
2. Explain the drawbacks of our previous solution
3. Intro to AWS StepFunctions & Lambda
4. Explain our architecture
Scalability is probably one of the most used buzzwords by software engineers in the last decade. It is a commonly accepted fact that a core trait of modern software is its ability to scale to varying loads rapidly. But how should one go about actually building scalable software without making their application overly complex and hard to understand? One of the options that has gained a lot of traction recently is called “serverless computing”. In this article, I outline how we used the serverless computing capabilities offered by Amazon Web Services (AWS) at Modulize
to migrate a monolithic part of our application - that created a major bottleneck - to a lightning-fast service on AWS that can scale almost infinitely.
Modulize allows users to do material takeoff and cost estimation based on building floorplans in PDF format. In order to provide our customers with a fast and feature-rich experience, each PDF passes through a pipeline for preprocessing & data extraction steps before it can be used in the app. This pipeline roughly looks like the illustration below.
After uploading a PDF, we first extract relevant metadata such as the size and orientation of each page. Since building plans often contain large amounts of text unnecessary for takeoff, we then generate a version of the PDF without any text. Then, we convert both versions of the PDF to images that can easily be displayed and manipulated in any browser. Finally, we extract relevant points in the building plan using computer vision.
Our MVP version of this pipeline was implemented via a worker process running in AWS EC2. Whenever a file was uploaded, it would be stored on S3 and its path would be pushed to a Redis queue. The worker process would listen to messages in this queue, pop the first one, and start processing it.
Since all PDFs in a building project have to pass through this pipeline first, it has to be as fast, reliable & scalable as possible. Using our initial approach, we could have scaled up by spawning more worker processes that would listen to the queue and pull files from it. However, each new worker process will needs a runtime environment (e.g., an EC2 instance) that can support its need for memory & compute - especially when processing complex files. If the runtime environment is not powerful enough, the worker would run out of memory and crash. Therefore, we would have had to reserve (and pay for!) many large EC2 instances even though we would only need their full capacity during load spikes. Obviously, there are solutions like AWS EC2 Autoscaling that allow you to dynamically adjust to load spikes. However, they would still require us to handle the complexity of how & when to scale ourselves.
The issue of wasting resources is further amplified by the monolithic nature of our initial PDF processing approach: Every worker is a single process that goes through all of the stages of the pipeline. Therefore, the runtime environment needs to provide sufficient resources to support the most resource-hungry steps of the pipeline, even if most of the tasks could have been done with a fraction of the memory & compute.
AWS Lambda implements the recently popularised concept of “serverless computing”. Even though the name seems to imply there are no servers at all, we still have to rely on something to actually carry out the computations. So while there are still servers in “serverless computing”, we are no longer concerned with setting them up or managing them. Instead, we just provide the cloud provider with a function that should be executed. It is then the responsibility of the cloud provider to execute this function in a suitable environment upon invocation.
This paradigm of computation has several advantages:
• Deployment complexity is reduced dramatically since all of the resources are managed by the cloud provider
• Increased resource efficiency & lower cost due to the pay-as-you-go payment model.
• Natural decomposition of software into smaller functions
• Basically infinite (i.e. limited by the capacity of the cloud provider) scalability of each function.
If we go back to the PDF processing pipeline outlined above, we immediately recognise that each part of it can be naturally mapped to an individual Lambda function. Since AWS manages allocation of the necessary resources for each function invocation, we do not have to care about scaling the pipeline ourselves anymore. Instead, we just invoke the functions as often and frequent as we need and let AWS take care of the rest!
AWS step functions
By separating our monolithic pipeline into smaller functions that are executed by AWS Lambda, we have already solved many of our previous issues including scalability and resource efficiency. However, up to this point we have not covered a very important aspect of the PDF processing pipeline: Since it lies in the nature of a pipeline that some tasks have to be executed in sequence, we need a way to orchestrate all of the Lambda functions. In essence, this means invoking a Lambda function with some input, taking its output and invoking another Lambda function with it. However, we also want to be able to model more complex control flow scenarios such as choosing a different function to invoke next depending on the output of the previous one. Additionally, we have to take into account that some of the invocations might fail due to a variety of reasons. In these cases, we usually want to retry the operation, execute a different Lambda function or terminate the whole pipeline.
The orchestration service that allows us to glue all of our individual Lambda functions together is called AWS Step Functions. It enables us to piece together the control-flow and execution graph in a very simple fashion and provides integration with many AWS services. The following figure illustrates the control flow of the “Generate Images” step similar to the setup in AWS Step Functions:
In this diagram, only the “Generate Backgrounds” step is an actual Lambda function, while all of the remaining parts are control flow actions provided by Step Functions. The first function, “Generate Background (small)” is an instantiation of the Lambda function with very a limited amount on computational resources. We use this scaled-down version initially since its resources are enough to process 99% of the PDFs we handle. If processing fails, however, we use Step Functions error recovery capabilities to try to resolve the error gracefully. Generally, processing of the page can fail with three different kinds of errors in this step:
• Retryable Error: Errors that we can expect to recover from by just trying the same operation again (e.g. A failed download from S3).
• Insufficient Resources: Processing failed because the function ran out of memory or timed out due to insufficient computing power. In this case we can just use an instantiation of the Lambda function with more resources, i.e. “Generate Backgrounds (large)”.
• General Error: Any unexpected error that we do not know how to handle. In this case, we report the failure and terminate the execution of the step function.
If execution of the first Lambda function succeeds, we just continue on to the next step in the pipeline. In the case of an error, execution branches according to the possibilities laid out above.
Note that all of the error-handling and retry logic is now external to the function itself. This separates concerns and let’s us focus on domain logic in the code of the function.
Obviously, it took quite a bit of work to move from our old, monolithic version of the PDF processing pipeline to the serverless version that relies on AWS. Therefore, the most important question is: Was it worth it to put in that effort?
After working with the serverless pipeline for quite a while now, I think that I can confidently say yes, it was definitely worth it. The most obvious and immediate advantage of moving our preprocessing to AWS is the fact that it became nearly infinitely scalable without us needing to put any thought into how to handle scaling and provisioning of compute resources. Another positive aspect was the reduction in compute cost: Since the pipeline is now executed in AWS Lambda, we were able to reduce the size & number of EC2 instances that were previously used to run it. This enabled us to cut the cost for computational resources dramatically for two reasons:
1. We no longer had to keep excess capacity to be prepared for load spikes.
2. Lambda Function executions are really cheap.
Apart from these relatively apparent advantages, there are also more subtle ones. Splitting the individual steps of the pipeline into separate functions also forced us to clean up the code to fit this new structure. This means reducing coupling between functions and increasing cohesiveness inside of a single function and ultimately results in a cleaner code base that is easier to extend and maintain.
In summary, I can only recommend you to have a look at the AWS Lambda & AWS Step Functions combination if you also want your app to scale to infinity!