How to overcome poor data quality when applying ML to early-stage venture

An exploration of why 'How do you apply ML in early-stage venture when there's not enough data?' is the wrong question.

How to overcome poor data quality when applying ML to early-stage venture

When I talk about applying machine learning to early-stage venture capital, I’m often asked the question: ‘How do you do it when there’s not enough data?’ but I don’t think this is the right question. The crux of the issue isn’t scarcity of data; it’s how you interrogate a complex problem space and break it down into individual learnable components.

Typically the holy grail for ML in any problem space is a complete end-to-end solution – one model carries out the whole process. But you usually can’t begin at that point. For any sufficiently complex problem, you don’t typically have the data to train such a model from the start. Further, if the problem is especially complex or ambiguous, machine learning technology might not currently be ready for the problem space.

Consider self-driving cars for example. The end-to-end solution would translate visual inputs directly into steering and braking outputs, with one neural network in the middle. But reaching that point is a gradual process. At least when the autonomous driving field started, there was not enough data to correlate visual information directly to driving decisions to train an ML model. The solution began by breaking down the problem into smaller tasks that could be solved, tackled with a blend of conventional software and ML models. So you might start with a model doing image recognition – “that’s a stop sign, that’s a person”. That segmentation might be passed to a neural network that decides how much of a risk these objects pose to the car, then another model that tags each of those segmented parts of the image with a score, before finally going to one that decides how to turn the steering wheel in response.

This same strategy of progressive problem decomposition is similar to how we think about applying machine learning to the venture process at Moonfire.

The end-to-end solution would look something like broad company and market information in and investment decision out, but there are a few problems with that. One is that the investment decision is not one judgement, but a sequence of discrete, interconnected decisions – and, unlike self-driving, you have to wait years in VC for feedback on model performance. Another is that there is just not enough data to train a model to solve the problem end-to-end. Instead, we incrementally build towards an end-to-end solution by breaking down the investment process into steps for which we do have data and apply deep learning models to automate those steps.

We start by segmenting the investment process into stages: data-driven sourcing, screening (which we further break down into sector, stage of business, geography, and venture-scale classification), thesis-based company evaluation, founder evaluation, etc. Each of these stages is powered by one or more specialised ML models. We use a mix of supervised, unsupervised, and self-supervised learning to train these models and we draw on data from various sources depending on the exact problem we’re trying to solve. For example, we use market data repositories like Dealroom and Crunchbase, social media, product launch sites, our own internal investment theses, our past investment decisions, data we can extract from the actions that our investors take when they’re interacting with our CRM, and more.

Our objective isn’t to replace the human element but augment it, accelerate it, and improve it. We want to keep our human investors and their expertise in the loop, focusing on the nuanced decisions ML cannot yet replicate. But given the sheer number of companies we see each week, they can’t make a decision on every one. So our task is to model our investors’ decision making processes so that we can apply them to companies they don’t have time to look at, only involving them in the most complex, ambiguous decisions. By recording their deliberations and outcomes, we create a feedback loop, gathering data to refine our models and move ever closer to a more comprehensive ML solution. That’s why, whenever we do our investment committee meetings, we always track why we reject companies, then use that to backtest our models and incorporate that into our evaluation engines.

It’s a continual process of questioning and re-questioning. What are we trying to predict? What data can we use for that? Can we get good data for that? Can we overcome the bad data that we have with better or different algorithms? We’re constantly rethinking how to decompose the problem, identifying what decisions we do have data for and trying to build models for automating those parts of the process. Where we don’t have data, we build tooling to allow investors to make their expert decisions in an instrumented way, such that we can extract training data from them.

We are also continually re-evaluating our approach. We are constantly reworking our stack and taking advantage of the latest ML advancements, not allowing our previous understanding of the problem space to over-influence the way that we were trying to approach the problem. We need to keep thinking about it from first principles – what’s the most end-to-end way that we can solve this problem?

In essence, applying machine learning to venture capital is an exercise in applying ML to a complex problem space. It's less about the data volume and more about the deconstruction of problems into machine-learnable components, and then, over time, reconstructing them into a cohesive, end-to-end process – or as near to that goal as is possible or desirable. The interplay between human intelligence and machine learning in this context is not just a technical pursuit but a fascinating exploration of human decision-making at scale. The ultimate goal is not to completely replace human judgement but to augment it, scale it, and enhance the quality of decisions in the complex, nuanced world of venture capital.