Automating malware scanning on uploaded files

I recently helped integrate some security requirements on a project. One of those requirements is that every file uploaded into the system must be scanned for malware and, in case a positive scan occurs, to be made unavailable for end-users.

For an initial version, we intentionally kept things simple and approached it pragmatically based on our existing infrastructure and its limitations. There are definitely ways that this setup could be improved, which I'll cover at the end of the article.

Setting up the infrastructure

Before we dive in, there's a bit of background about our infrastructure: We use DigitalOcean as a cloud provider and use several managed services such as Managed Databases, Spaces and App Platform.

We decided to use popular open-source software called ClamAV to handle the actual scanning. It contains a database of known viruses, worms, trojans and other malware that is constantly kept up-to-date automatically and provides several ways to scan files.

ClamAV has two primary ways to scan files:

a Unix socket can be used if you want to scan a file that is uploaded on the same machine.
a TCP socket can be used if you want to scan a file over a network connection.

While exploring how we wanted to set this up, we quickly realized that running ClamAV as part of our app was not an option because of several reasons:

We would need to create a custom Docker image containing ClamAV and our own app and deploy it to our App Platform service.
ClamAV uses an in-memory database that requires at least 3-4 GBs, such an App Platform Container would cost around €40/mo (for each environment).

Instead, we opted to do the scanning over the TCP socket and set up a €5/mo droplet, added some additional swap memory, and installed ClamAV following their instructions. We could then re-use this internal service across all our environments, like a real micro-service!

Some limitations

However, we did experience some limitations caused by App Platform, mainly related to networking with these in particular:

App Platform applications do not have persistent IP addresses.
Apps deployed on App Platform are not connected to VPC networks. All connections from apps to other services running on DigitalOcean occur over the public network, including connections between apps and DigitalOcean Managed Databases. See How to Manage Databases in App Platform for detailed instructions about how to connect apps to databases.

This meant that we couldn't let our App Platform apps communicate with the ClamAV service over the private network and could not limit the port connection to those apps. This essentially makes the service available on the public internet, which sadly is a trade-off we had to make.

Nonetheless, at this point, we had our microservice running, and we could start scanning files.

Doing the scanning

The next step was to scan each file as they were uploaded in our PHP application.

We wrote a small service using Quahog, a client that interacts with ClamAV over the TCP (or Unix) sockets that take a file path and scan the found resource stream. It's elegant and fast.

declare(strict_types=1);

namespace App\Infrastructure\Security\VirusScan;

use Illuminate\Filesystem\FilesystemManager;

use Socket\Raw\Factory;

use Xenolope\Quahog\Client;

final readonly class ClamAVScanner implements Scanner

{

    public function __construct(

        private FilesystemManager $filesystemManager,

        private string $socket

    ) {

    }

    public function scan(string $path): ScanResult

    {

        $resource = $this->filesystemManager->disk()->readStream($path);

        if ($resource === null) {

            return ScanResult::notFound();

        }

        $result = $this->createClient()->scanResourceStream($resource);

        if ($result->isError()) {

            return ScanResult::error($result->getReason() ?? 'Unknown');

        }

        if ($result->isFound()) {

            return ScanResult::infected($result->getReason() ?? 'Unknown');

        }

        return ScanResult::ok();

    }

    private function createClient(): Client

    {

        return new Client(

            socket: (new Factory())->createClient($this->socket),

            mode: PHP_NORMAL_READ,

        );

    }
}

In our domain code, a DocumentWasUploaded event would be dispatched whenever a file was uploaded, which in turn would queue up a ScanDocument command that handles the scanning and quarantining if needed:

final readonly class ScanDocumentHandler

{

    public function __construct(

        private Scanner $scanner,

        private DocumentStorage $storage,

    ) {

    }

    public function handle(ScanDocument $command): void

    {

        $document = $command->document;

        $result = $this->scanner->scan($command->document->stored_url);

        if ($result->isOk()) {

            return;

        }

        if ($result->isError()) {

            throw CanNotScanDocument::because($result->reason ?? 'Unknown error');

        }

        report(new DocumentIsInfected($document, $result->reason ?? 'Unknown'));

        $this->storage->moveIntoQuarantine($document);

        $document->quarantined_at = CarbonImmutable::now();

        $document->save();

    }
}

Evolving our infrastructure

As I said at the start, we chose a pragmatic and easy solution and could build this version in roughly a day. However, I would like to take a moment to consider alternative approaches and how they could be improved. It's also just good fun to think about architecture in this way.

Google has written an entire guide on how to set up such a malware scanning pipeline using the Google Cloud platform, using an entire guide on several of their products such as Cloud Run functions / Cloud Storage, etc.

Keep in mind that they are selling you their cloud infrastructure with all the bells and whistles; there are some things to note.

Malware scanning should primarily be an infrastructural concern

In our solution, our application decides when a scan should occur, which means that when new methods of uploading files are introduced, you must also write code for scanning and handling those.

The beauty of cloud platforms like Google Cloud and AWS is that all of the services can interact with each other and make everything event-driven, allowing you to handle tasks through functions and lambdas. This can be done for Google Cloud through Eventarc and AWS uses SNS. Sadly, DigitalOcean seems to have no such service (yet), so you do need to handle all of this yourself.

It's very likely that your application should still need to know about a file being infected and take appropriate action. Maybe you need to remove a database record? Or let your users know? One option is to send a webhook (from the malware scanner service) to the application, or for bigger applications, rely on a message bus).

To conclude

Scanning for malware is, in my opinion, something relevant for all the applications we help build that deal with files, both for end-users and ourselves. No one would want to wake up and see that there's a Bitcoin miner running your server.

This article shows how simple it could be to set up, but I'm sure there are several ways to achieve it. Who knows, maybe there's a market for a SaaS product 💰.

Automating malware scanning on uploaded files

Bram Devries

What will the state of AI be like by this time next year?

Things we do in our first weeks as Fractional CTO

Stop Coding, Start Leading: Shifting dynamics for startup CEOs

Setting up the infrastructure

Some limitations

Doing the scanning

Evolving our infrastructure

Malware scanning should primarily be an infrastructural concern

To conclude

Member discussion

The full-stack enigma

The hidden cost of multiple repositories

A guide to vibe coding vs AI-assisted development

How to pragmatically leverage AI as a startup

Cloudy with a chance of function calls

Automating malware scanning on uploaded files

Bram Devries

What will the state of AI be like by this time next year?

Things we do in our first weeks as Fractional CTO

Stop Coding, Start Leading: Shifting dynamics for startup CEOs

Get all the latest posts delivered straight to your inbox.

Setting up the infrastructure

Some limitations

Doing the scanning

Evolving our infrastructure

Malware scanning should primarily be an infrastructural concern

To conclude

Stay in the loop

Member discussion

The full-stack enigma

The hidden cost of multiple repositories

A guide to vibe coding vs AI-assisted development

How to pragmatically leverage AI as a startup

Cloudy with a chance of function calls