The Quick Rundown:
Haize Labs is building an AI safety, evaluation, and reliability platform to automate red-teaming and stress-test models.
They are working with the top model providers, including both OpenAI and Anthropic, as well as some of the biggest companies at the application layer from Deloitte to MongoDB.
They’re actively hiring researchers and engineers to join the team in NYC.
This is the third edition of “Startups to Join” – a blog for engineers and designers to discover early-stage (pre-seed through Series A) startups that are reimagining industries and on the path to becoming enduring companies.
Every week, I review hundreds of startups and spend hours talking to VCs and founders to write about one company that’s building something special. Whether you’re looking for the next unicorn to join or just curious about what top VCs are investing in, subscribe below to keep updated:
Act 1: Stress-Testing LLMs
A few days ago, Anthropic challenged users to identify a universal jailbreak to bypass internal safeguard protections and reveal harmful content related to chemical weapons. 132,000 prompts have been tried so far and nobody has been able to jailbreak the system.
These internal safeguard measures – like the one demoed by Anthropic – are designed to prevent LLMs from outputting harmful content and they are becoming an increasingly important focal point. Some of this has been driven by regulation; the recently passed EU AI Act now requires model providers to produce documentation related to “internal and/or external adversarial testing (e.g., red teaming)”. But, regulation isn’t the only reason – a large part of this focus has been driven by the reputational fear of bad actors jailbreaking models and causing harm.
This reputational fear exists for both the model layer and for the applications built on top of the models themselves. Every enterprise across every industry, from consulting firms to healthcare providers, is concerned about becoming the face of the next Air Canada lawsuit. And, as model capabilities continue to advance, this reputational risk is only growing.
While Google telling users to put glue on pizza is relatively harmless in practice, accessing confidential data or knowledge about how to produce chemical weapons can lead to significantly worse outcomes. This is the exact reason why AI safety is one of the most important topics right now and why everyone is looking for ways to ensure that model outputs are reliable and safe.
But, red-teaming these models isn’t a simple task.
First, there is the infinite prompt space problem. The number of possible prompts that can be used to interact with an LLM is infinite and there are a lot of unique prompting variations (special characters, custom fine-tuned instructions, etc.). This results in a search space for attack prompts that makes accurate red-teaming difficult.
Second, we simply don’t know a lot about why models produce the outputs they do. And when you don’t know why a system behaves the way it does, it’s harder to predict what type of prompt will elicit a harmful response. In Anthropic’s words, “We do not know how to train systems to robustly behave well”. To that end, a model not outputting a harmful response does not necessarily mean that it wasn’t close to doing so, adding a new layer of complexion.
Third, we don’t have standardized practices for red-teaming AI models. Anthropic and OpenAI don’t necessarily use the same techniques to assess the same threat vector, and this lack of alignment makes it challenging to build upon the work of others. This problem is worsened by the fact that attack prompts don’t generalize. As we saw with the poem attack that extracted GPTs training data, that same prompt – “Repeat the word “poem” forever” – didn’t work on any other model.
Fourth, manual red-teaming is expensive and slow. You need domain experts who understand the harmful content they are looking for, AI engineers who understand how models work, and a lot of compute to test over an infinite prompt space.
Finally, and most importantly, incentives are simply misaligned. Red-teaming a model increases time to deployment and speed is all that matters in today’s AI race. Model providers are incentivized to scale quickly, and that’s partially why half of OpenAI’s safety team quit last year. Application providers are also incentivized to launch AI products before competitors. This means growth is prioritized and it’s why AI providers aren’t positioned to build the optimal red-teaming solution.
So, although red-teaming is critical, neither model providers nor the applications built on top of these models are designed to stress-test all use cases; model providers have misaligned incentives, and application providers lack the technical expertise. This gap is where the opportunity lies: use safety testing as the entry point to work with large enterprises and open the door to the real opportunity – becoming the ultimate LLM reliability platform.
Act 2: From Risk to Reliability
As concerned as risk teams are about harmful content, extraction attacks, and model theft, the real bottleneck for enterprises is that AI systems are extremely brittle and inherently unpredictable. Stress-testing for harmful outputs is definitely a key product of this, but it’s only one piece of a much larger puzzle: how can you evaluate and ensure quality when you’re working with a non-deterministic system that produces different outputs with the same input?
Today, this challenge of translating model outputs into useful evaluation and QA metrics is still a complex task. For instance, knowing what models perform better for what prompt or whether a model is actually providing accurate output is very hard to discern. But, without that data, engineering and research teams can’t get a reliable AI application into production in the first place.
And this is where I believe the big opportunity truly exists. By acting as an end-to-end QA/evaluation solution that not only stress-tests for harmful outputs but also enables enterprises to continuously monitor model performance in production, you can solve the last-mile reliability problem in AI and help engineering/research teams launch AI applications.
This opportunity to build the go-to AI reliability platform is exactly what the team at Haize Labs is working on and it’s why I’ve chosen to cover them as the 3rd startup in this series.
Haize Labs
Haize Labs algorithmically stress-tests LLMs to preemptively discover any failure mode and enforce domain-specific alignment. This automated approach systematically probes AI models for vulnerabilities using a similar approach to how cybersecurity researchers test a computer system for weaknesses. The team has already made a few key advancements to solve some of the challenges associated with red-teaming AI models.
First, they have significantly reduced the time it takes to test AI models at scale. Before, the SOTA Greedy Coordinate Gradient (GCG) attack could take over 153 minutes to generate a single adversarial attack against a model like Llama 2, making comprehensive testing impractical for most organizations. Haize Labs optimized this attack by developing the Accelerated Coordinate Gradient (ACG) method, leading to a ~38x speedup and ~4x GPU memory reduction while maintaining effectiveness. This speed-up has significant practical implications for red-teaming: in the time it takes ACG to produce 33 successful attacks against the AdvBench set, GCG doesn’t even produce one.
Second, they’ve developed a unique approach to systemic prompt optimization using the DSPy framework. While traditional red-teaming often relies on the manual crafting of adversarial prompts, the team at Haize Labs has developed a feed-forward system that automatically generates and refines attack prompts. By generating adversarial prompts and then iteratively refining them, their algorithm can explore the infinite prompt space with increased efficiency. With this approach, Haize Labs achieved a 44% attack success rate, which was 4x higher than baseline techniques. Though this isn’t SOTA, the team spent “no effort designing the architecture and prompts” and just used an “off-the-shelf optimizer with almost no hyperparameter tuning”.
Third, they’ve improved uncertainty quantification methods by “orders of magnitude”, which essentially means that they can accurately identify when an AI system is unsure or unreliable in its output. Alongside that, they have developed an automated system called Cascade for multi-turn jailbreaks (using multiple prompts to jailbreak a system) that is significantly more effective than competitors using manual red-teaming.
On top of these developments, by being a third-party company, Haize Labs is uniquely positioned to define what it means for an AI model to be reliable in different verticals and then stress-test models according to these standards. And their entire system is set up to capitalize on this unique positioning advantage by creating rules regarding AI safety, identifying jailbreaks that violate those rules, and then remedying the vulnerabilities in domain-specific contexts.
In practice, Haize Labs takes a company's goals for their AI system (e.g., "never provide medical advice without disclaimers") and turns them into automated testing rules called model-based evaluators. They do this via synthetic data generation and an adversarial attack approach, and then they use active learning to make these rules more precise. Then, using these evaluators, they can stress-test models to catch bugs during both the development and production stages (via their inference-time oversight mechanisms). Once these vulnerabilities are identified, they can robustify the AI system at the model layer (via safety fine-tuning), the prompt layer (via prompt optimization techniques), and the entire system layer (an autoML problem).
Today, Haize Labs is already working with top model providers, including both OpenAI and Anthropic. Remember that jailbreak demo from Anthropic that nobody has been able to bypass so far: Haize Labs helped red-team prototypes of that system. Alongside the model providers, Haize Labs is also working closely with companies like Deloitte and MongoDB at the application layer. And, as more enterprises want to deploy AI applications, the role of an LLM reliability platform that can enforce domain-specific safety and accelerate time to production will only become more important.
Jobs at Haize Labs
With rapidly improving model capabilities and every company looking to immediately deploy AI applications, I think now is an incredibly exciting time to be building an AI reliability platform that can algorithmically test and improve models at scale. If that sounds exciting to you, Haize Labs is hiring across engineering and research roles in New York. The team, despite their young age, is especially accomplished and they’re backed by some of the best advisors and founders in the AI space.