I love cutting-edge tech, but I hate hyperbole, so I find AI to be a real paradox. Somewhere in that whole mess of overnight influencers, disinformation and ludicrous claims is some real "gold" - AI stuff that's genuinely useful and makes a meaningful difference. This blog post cuts straight to the good stuff, specifically how you can use AI with Have I Been Pwned to do some pretty cool things. I'll be showing examples based on OpenClaw running on the Mac Mini in the hero shot, but they're appl
I love cutting-edge tech, but I hate hyperbole, so I find AI to be a real paradox. Somewhere in that whole mess of overnight influencers, disinformation and ludicrous claims is some real "gold" - AI stuff that's genuinely useful and makes a meaningful difference. This blog post cuts straight to the good stuff, specifically how you can use AI with Have I Been Pwned to do some pretty cool things. I'll be showing examples based on OpenClaw running on the Mac Mini in the hero shot, but they're applicable to other agents that turn HIBP's data into more insightful analysis.
So, let me talk about what you can do right now, what we're working on and what you'll be able to do in the future.
Model Context Protocol (MCP)
A quick MCP primer first: Anthropic came up with the idea of building a protocol that could connect systems to AI apps, and thus the Model Context Protocol was born:
Using MCP, AI applications like Claude or ChatGPT can connect to data sources (e.g. local files, databases), tools (e.g. search engines, calculators) and workflows (e.g. specialized prompts)—enabling them to access key information and perform tasks.
If I'm honest, I'm a bit on the fence as to how useful this really is (and I'm not alone), but creating it was a no-brainer, so we now have an MCP server for HIBP:
https://haveibeenpwned.com/mcp
You can't just make an HTTP GET to the endpoint, but you can ask your favourite AI tool to explain what it does:
In other words, all the stuff we describe in the API docs 🙂 That's an overly simplistic statement, and there are many nuances MCP introduces beyond a computer reading docs intended for humans, but the point is that we've implemented MCP and it's there if you want it. Which means you can easily use the JSON below to, for example, extend GitHub Copilot:
This is really the point of the whole thing - how can humans use it to do genuinely useful stuff? In particular, how can they use it to do stuff that was hard to do before, and how can "normies" (non-technical folks) use it to do stuff they previously needed developers for? I've been toying with these questions for a while now. Here's what I've come up with:
Firstly, I'm going to do all these demos on OpenClaw. I've been talking a lot about that on my weekly live streams over the past month, and the "agentic" nature of it (being able to act as an independent agent tying together multiple otherwise independent acts) is enormously powerful. Every company worth its AI salt is now focusing on building out agentic AI so whilst I'm using OpenClaw for these demos, you'll be able to do exactly the same thing in your platform of choice either now or in the very near future.
I'm using a Telegram bot as my interface into OpenClaw, let's kick it off:
Easy, right? 🙂 There's a different discussion around how secrets are stored and protected, but that's a story for another time (and is also obviously dependent on your agent). But the key is easily rotated on the HIBP dashboard anyway. If you don't have a key already, go and take out a subscription (they start at a few bucks a month), and you'll be up and running in no time.
Now that I know I'm connected, let's learn about how I'm presently using the service:
Most of these are pretty obvious, but I've also included another here that I use to monitor how the service is behaving with a large organisation. It's a real domain with real data, so I'm going to obfuscate it to preserve privacy, but it's a great demonstration of how useful AI is. In fact, the inspiration of this blog post was when I received this notification last week:
One of the most asked questions after someone in a large org receives an email like this is "who are those 16 people in the breach"? Because we can't reliably filter large domains in the UI, I'd normally suggest they either download the CSV or JSON format in the dashboard, then search for "Hallmark" in there or use the API and write some code. But now, there's a much easier way:
Well that was easy 😎 I like the additional context too, and now it has me curious: what have these people been up to?
Because I'm on a Pro plan (or if you're still on the old Pwned 5 plan), I've also got access to stealer logs. Let's see what's going on there:
If you were running an online service, that first number would indicate compromised customers. But as OpenClaw has suggested here, the second number is the one that's interesting in terms of employees entering their data into other websites using the corporate email address. But they'd never reuse the same password as the work one, right? 🤔 Best check which services they're entering organisational assets into:
The first one makes sense and is extra worrying when you consider these are people infected with infostealers. That's not necessarily malware on a corporate asset; they could always be using an infected personal device to sign into a corporate asset... ok, that's also pretty bad! I was a bit surprised to see Steam in there TBH - who's using their corporate email address to sign into a gaming platform?! A quiet chat with them might be in order. And the bamboozled.net stuff is weird, I want to understand a bit more about that:
Now I'm losing interest in this blog post and am really curious as to what's actually in the data!
Ok, so there's an entire rabbit hole over there! Let's park that, but think about how useful information like this is to infosec teams when you can pull it so easily. Or how useful info like this is to HR teams 😬
Keep in mind, these are corporate addresses tied to the company and are the company's property, so, yeah...
But remember the agentic nature of OpenClaw means we can ask it to go off and run tasks in the background, tasks like this:
This was just a little thought experiment I set up a few days ago and forgot about until yesterday, when I loaded a new breach:
I never asked it to look for "functional/system accounts"; it just decided that was relevant. And it is - this breach clearly had a lot of data in it related to purchases of services, which is an interesting aspect.
The idea of running stuff on a schedule opens up a whole raft of new opportunities. For example, monitoring your family's email addresses: "let me know when mum@example.com appears in a new breach". From here, your creativity is the only limit (and even that statement is debatable, given how much stuff AI agents come up with on their own). For example, creating visualisations of the data:
I could go on and on (I started going down another rabbit hole of having it generate executive-level reports with all the data), but you get the idea.
The AI Pipeline
This is about what's in our pipeline, and the primary theme is putting tooling where it's more easily accessible to the masses. Creating a connector in Claude, an app in ChatGPT, and similar plumbing in the other big players' AI tools is an obvious next step. This will likely involve adding an OAuth layer to HIBP, allowing end users to configure the respective tools to query those HIBP APIs under their identity and achieve the same results as above, but built into the "traditional" AI tooling in a way people are familiar with.
Future
A big part of this is about AI enabling more human conversations to achieve technical outcomes. I spotted this from Cloudflare just yesterday, and it's a perfect example of just this:
Cloudflare dashboard can now complete tasks for you.
- "Create a Worker and bind a new R2 bucket to it" - "Change my DNS records to 1.1.1.1" - "How many errors have happened this week"
Not only do we tell you, but we show you with generative UI.
I've been pretty blown away by both how easy this process has been and how much insight I've been able to draw from data I've been sitting on for ages. We'll be building out more tooling and easily reproducible demos in the future, and I'm sure a lot of that will do stuff we haven't even thought of yet. If you give this a go and find other awesome use cases, please leave a comment and tell me what you've done, especially if you've cut through the hyperbole and created some genuinely awesome stuff 😎
For a hobby project built in my spare time to provide a simple community service, Have I Been Pwned sure has, well, "escalated". Today, we support hundreds of thousands of website visitors each day, tens of millions of API queries, and hundreds of millions of password searches. We're processing billions of compromised records each year provided by breached companies, white hat researchers, hackers and law enforcement agencies. And it's used by every conceivable demographic: infosec pros, "mums a
For a hobby project built in my spare time to provide a simple community service, Have I Been Pwned sure has, well, "escalated". Today, we support hundreds of thousands of website visitors each day, tens of millions of API queries, and hundreds of millions of password searches. We're processing billions of compromised records each year provided by breached companies, white hat researchers, hackers and law enforcement agencies. And it's used by every conceivable demographic: infosec pros, "mums and dads", customer support services, and, according to the data, more than half the Fortune 500 who are actively monitoring the exposure of their domains. So yeah, "escalated" seems fair!
Amidst all the time spent processing data, we've been trying to figure out where to invest energy in building new stuff. In essence, data breaches are pretty simple: you've got a bunch of exposed email addresses attributed to a source, sitting next to a whole bunch of fields we describe with metadata. Our goal has always been to help people use this data to do good after bad things happen, and today we're launching a bunch of new features to do just that. So, here goes:
New Features, New Plans
In the beginning (ok, in "recent years"), there was one plan we referred to as "Pwned", and within that, there were various levels. For example, the entry-level plan has been "Pwned 1," and to this day, more than half our subscriptions are on it. That's "a coffee a month" for a simple service that, by the raw numbers, does precisely what most of our subscribers are looking for. These are typically small businesses that make a handful of API queries or monitor a domain or two with a few email addresses. It's simple, effective and... insufficient for larger organisations. So, we added Pwned 2, 3 and 4, and they all added more RPMs for email searches and more capacity for searching larger domains. Then we added Pwned 5, which added stealer log support, and somewhere along the way also added Pwned Ultra tiers for making large numbers of API requests. As a result, that one "plan" added more and more stuff at different levels and ultimately became a bit kludgy.
Today, we're launching a bunch of new features to better support the volume and privacy needs of our subscribers, and we're shuffling our existing plans to help do this. Here's what they now look like:
Core: The fundamentals, largely being what we already had and designed for entry-level use cases
Pro: Contains a bunch of the new features designed for larger orgs and those searching domains on behalf of customers
High RPM: The old "Ultra" plan levels, designed solely for making large volumes of requests to the email search API
Enterprise: We've had this for many years now, and it's a more tailored offering
So, that's the high-level overview. Let's now look at all the new stuff and everything that changes:
Supporting MSPs Monitoring on Behalf of Third Parties
For most people, this won't sound particularly exciting, but I'm putting it up front because I'll refer to it when describing the more important stuff shortly. In the past, we've had the following carve-out in our terms of use, namely, what you're not permitted to use the service for:
the benefit of a third party (including for use by a related entity or for the purpose of reselling or otherwise making the Services available to any third party for commercial benefit)
This excluded managed service providers from, for example, monitoring their customers' domains as part of their services. That clause has now been revised with the preceding text:
unless you have purchased a Paid Service which expressly allows you to do so
Which means we can now welcome MSPs to the Pro and High RPM tiers. They can't just take HIBP and use it to create a competing product (for obvious reasons, that's a pretty standard clause within many online services), but they can absolutely add it to the offerings they provide to their own customers. And we're adding new features to make it easier to do just that, for example:
Automating Domain Verification
Preserving privacy whilst still providing a practical, effective service has always been a balancing act, one I think we've gotten pretty spot on. But the hoops people have had to jump through for domain verification, in particular, have been cumbersome. An organisation wanting to add a bunch of its domains has had to go through the process one by one via the web interface, then verify control over them one by one. They'd spend a lot of time doing kludgy, repetitive work. Today, we're launching two new ways of adding domains in a much more automated fashion, and the first is the verifying via DNS API:
Successfully adding a pre-defined TXT record to DNS is solid proof that whoever is attempting to search that domain genuinely controls it. As well as the old kludgy way of doing it in the browser, waiting for DNS to propagate, then coming back to the browser to complete the verification, we can now fully automate the process via API. Here's how it works:
Call the HIBP API to generate the TXT record token
Call the API on your DNS provider to add the token to the TXT record
Call another HIBP API to validate that the token exists
This is easily scripted in your language of choice, and you can enumerate it over as many domains as you like. You can also keep retrying step 3 above as often as needed when DNS takes a little while to do its thing. It's all now fully documented in the latest version of the API, and ready to roll. But what if you don't control the DNS? Perhaps it's a cumbersome process in your org, or you're an MSP monitoring your customers' domains, but you don't have control of DNS. That's where the verifying by email API comes in:
We've long had a verification process that involves choosing one of several standard aliases on a domain to email a verification token to. You do this via the dashboard, grab the token sent to the email, paste it back into the dashboard and the domain is now verified. The new API makes that much easier, especially when multiple domains are being verified. Here's how it works:
Call the HIBP API and specify one of the pre-defined aliases to send a verification email to
Click the link in the email and approve the domain to be added to the requester's account
And that's it. We see this being particularly useful for MSPs who can now send a heap of emails on their customers' domains, and so long as someone receives it and clicks the link, that's the verification process done. That API is also now fully documented and ready to roll and is accessible to all Pro plan subscribers.
Auto-verifying Subdomains
This one was just unnecessarily frustrating for larger customers who spread email addresses over multiple subdomains. Let's say a company owns example.com and they successfully verify control of it, but then they distribute their email addresses by region. They end up with addresses @apac.example.com and @emea.example.com and so on, and in the past, needed to verify each subdomain separately.
Turns out we have 154 votes for this feature in User Voice, which is substantially more than I expected. So, in keeping with the theme of the Pro plan making it easier on larger orgs, anyone on that level can now add their apex domain, verify it accordingly, then go to town adding all the subdomains they want without the need for verifying each one.
Bringing K-Anonymity Searches to the Masses
Until today, every time you took out a subscription via the public website and started searching email addresses, it looked like this:
GET https://haveibeenpwned.com/api/v3/breachedaccount/test@example.com
Clearly, this involved sending the email address to HIBP's service. Whilst we don't store those addresses, if you're sending data to a service in this fashion, there's always the technical capability for us to see that piece of PII and associate it back to the requester via their API key. This approach is what we'll refer to as "direct email search". Let's now look at k-anonymity searches, and I'll break it down into a few simple steps:
Start by creating a SHA-1 hash of the address to be searched, so for test@example.com, that's:
567159D622FFBB50B11B0EFD307BE358624A26EE
Take the first 6 characters of the hash and pass them to the new API:
GET https://haveibeenpwned.com/api/v3/breachedaccount/range/567159
The prefix presently contains 393 suffixes, and if one of them matches the remaining characters of the hash of the full email address, you know that's the address you're looking for.
This is the same methodology we've been using for years with the Pwned Passwords search, and we're currently serving about 18 billion requests a month, so it seems that lots of people have easily gotten to grips with it. It's a pretty simple technical concept with great privacy attributes, and it's fully documented on the API page.
K-anonymity searches are now available to all Pro and High RPM subscribers at the same rate limit as the direct searches. That rate limit is shared, so you can either make 100% of them to k-anon or 100% to the direct search or go 50/50. We're really happy with the privacy aspects of this API and we know it ticks a box a lot of orgs have been asking for.
Unsmoothing the API Rate Limit
Previously, when you took out a 10-request-per-minute API key, we implemented a rate limit of 1 request every 6 seconds. The same logic applied to all the higher-tier products, too, and the reason was simply to distribute the load across each minute more evenly or in other words, "smoothing" the rate at which requests were made. That was important earlier on as the underlying Azure infrastructure had to support that traffic, and sudden bursts could be problematic.
But the other thing that was problematic is that people (quite reasonably) assumed that they could make 10 fast requests, wait a minute, then go again. This led to support overhead for us and customer frustration, and neither is good.
With these latest updates, 10RPM (and all the other RPMs) is now implemented exactly as it sounds - 10 requests in any one-minute block. Here's our Azure API Management policy:
In other words, we've "unsmoothed" it. You can hammer the service 10 times in quick succession, then wait a minute, and you won't see a single HTTP 429 "Too many requests" response. Equally, if you're on a 12,000 RPM plan (and you can actually send that many requests quickly!), you won't see an unexpected 429. We can do this now because of the way we serve a huge amount of content from Cloudflare's edge, unburdening the underlying infrastructure from sudden spikes.
It's a little thing, but it'll solve a lot of unnecessary frustration for a bunch of people, including us. That's implemented across every single plan, too, so everyone benefits.
We Just Wanna Go (Even) Fast(er)
Here's our challenge today: how do we enable millions of people a day to search through billions of records with near instantaneous results... and do it affordably? They're somewhat competing objectives, but every now and then, we find this one neat trick that dramatically improves things. About 18 months ago, I wrote about how we were Hyperscaling HIBP with Cloudflare Workers and Caching. The basic premise is that, as people search the service, we build a cache in Cloudflare's 300+ edge nodes that includes the entire hash range just searched for (see the k-anon section above). We flush that out on every new breach load and as it builds back up to the full 16^6 possible cachable hash ranges, our origin load approaches zero and everything gets served from the edge. Almost, because we have the following problem I described in the post:
However, the second two models (the public and enterprise APIs) have the added burden of validating the API key against Azure API Management (APIM), and the only place that exists is in the West US origin service. What this means for those endpoints is that before we can return search results from a location that may be just a short jet ski ride away, we need to go all the way to the other side of the world to validate the key and ensure the request is within the rate limit.
Or at least we had that problem, which we've just solved with a simple fix. The quoted problem stemmed from the fact that, to ensure everyone adhered to the rate limit, we performed the APIM check before returning any data. That meant always waiting for packets to make a round trip to America, even when the data was cached nearby. But what we realised is that adhering to the rate limit can be eventually enforced; it really doesn't matter too much if a request or two in excess of the rate limit slips through, then we enforce it. The reason why that epiphany is important is that with that in mind, we can start returning data to the client immediately whilst doing the APIM check asynchronously. If the request exceeds the rate limit, Cloudflare will block subsequent requests until the client starts making requests within their limit. So, the rate limit check is no longer a blocking call; it's a background process that doesn't delay us returning results.
What that means is a dramatic reduction in the time til first byte:
That's almost a 40% reduction in wait time! It's an awesome example of how continuous investment in the way we run this thing yields tangible results that make a meaningful difference to the value people get from the service.
Passkeys!
Just one more thing...
This is all new, all free and all available to everyone, whether they have a paid subscription or not. Remember when I got phished last year? I sure do, and I vowed to use that experience to maximise the adoption of passkeys wherever possible. So, putting my money (and time) where my mouth is, we've now launched passkeys as an alternate means of signing into your dashboard:
This saves you needing a "magic" link via email on every sign-in, and whilst it doesn't constitute 2FA (the passkey becomes a single factor used to sign in), it massively streamlines how you access the dashboard. And because we never used passwords for access in the first place, the only account-takeover risk our customers face is someone gaining access to either their email account or to where they store their passkeys (in either case, they have much bigger problems!).
Here's how it works: start by signing into your dashboard, then heading over to the "Passkeys" section on the left of the screen and adding a new one:
The name is so you can keep track of which passkey you save where. I save most of mine in 1Password, but you can also save them on a physical U2F key or in your browser, for example. Clicking "Continue" will cause your browser to prompt you for the location where you'd like to store it and again, that's 1Password for me:
And that's it - we're done!
So, how does it work? Check this out, and don't blink or you'll miss it:
Compared to typing in your email address, hitting the "Sign In" button, flicking over to the mailbox, waiting for the mail to arrive, then clicking the link, we're down from let's call it 30 seconds to about 3 seconds. Nice 😎
Even though there isn't much security benefit to doing this on HIBP (you can still sign in via email, too), we wanted to build this as an example of just how easy it is. It took Stefán about an hour to build a first cut of this (with support from Copilot), and, aside from the dev time, building passkey support into your website is totally free. There are no external services you need to pay for, no hardware to buy or special crypto concepts to grasp. Passkeys are dead simple, and web developers with even a passing interest in security and usability should be adding support for them right now. We also wanted to make sure they were freely available to anyone, regardless of whether you have a paid subscription, because security like this should be the baseline, not a paid extra. So, go and give them a go in HIBP now.
One immediate difference to how we've previously represented the plans is that the annual price is now shown as a monthly figure. It turns out that the vast majority of our subscribers choose annual billing, so leading with the per-month pricing puts the least relevant figures front and centre. As we looked around at other services, that was a pretty consistent trend, especially when one annual subscription is more cost-effective than renewing a monthly one 12 times (annual is roughly 10x a year's worth of month-by-month payments).
Another change is that we're going to cap the number of larger domains (those with over 10 breached addresses) that can be searched on each subscription. Let me explain why: Every time we load a data breach, each record in the breach is checked against each domain being monitored. In 2025, we added 2.9 billion breached records, and we have 400k monitored domains. Multiply those out, and we're looking at 1.16 quadrillion checks for our subscribers each year. This is all handled by SQL queries, so it's not like we're getting hit with human overhead at scale, but we're getting hit hard with SQL costs. Across everything we pay to run this service (storage, app hosting, functions, API management, App Insights, bandwidth, etc.), the SQL bill is more than the total for all other services combined. In addition to how we currently calculate plan size based on breached email count, we're adding a cap on the number of domains per plan.
Only domains with more than 10 breached addresses are included in the cap.
The “10” threshold aligns with the existing requirement for a domain to need a subscription at all, and means this change impacts only a single-digit percentage of subscribers. It also helps filter out noise so the cap reflects domains that actually matter. For those larger domains beyond the cap, all current alerts will continue to work just fine until they run a search. At that time, they'll have the option to upgrade the plan or reduce the number of domains. But none of that affects existing subscribers now:
There will be no changes to existing plans until at least August 2 this year.
We do an annual price revision each August, and that's already factored into the table above. That applies to any new subscriptions immediately, but it won't touch existing ones until August 2 at the earliest. The revised pricing only kicks in on the next subscription renewal after that date, so it could be as late as August 2027 if you're an existing subscriber. The same goes for the cap on the number of domains being monitored - there's no impact on existing subscribers until at least August. That leaves plenty of time to cancel, downgrade, upgrade, or just do nothing, and the plan will automatically roll over to the new one. We'll be emailing everyone in the coming days with details of precisely what will change.
Note: if you had an old Pwned 5 subscription for the sake of stealer log access, we'll be rolling all those folks over to Pro 1 and applying a permanent discount code to ensure there's no change in price by moving to the higher plan (it'll actually drop slightly). That'll be explained in the upcoming email, it just made more sense to keep stealer logs in Pro and move people over, and this'll just give them free access to all the new stuff too.
Speaking of which, the thing that (almost) nobody reads but everyone is subject to has been revised to reflect the changes described above - the Terms of Use. For the first time, we've also summarised all the changes and linked through to an archive of the old ones, so if you really love digging through a long document prepared by lawyers, this should make you happy 😊
We're Still Doing Credit Cards via Stripe
While I'm here, just a quick comment on our ongoing Stripe dependency and, as a result, the necessity to pay for public services via credit card. I've written before about some of the challenges we've faced with customers' requests to pay by other means and how, push comes to shove, they (almost) always find a way around internal barriers. Let me share a recent empirical anecdote about this:
Just the other day, I had a call with a Fortune 500 company that was initially interested in our enterprise services. As the discussion unfolded, it became evident that the public services would more than suffice and that the enterprise route was too burdensome for their particular use case. Be that as it may, the procurement lady on the call was adamant that payment by credit card was impossible, even going to the extent of making a pretty bold statement:
No Fortune 500 company is going to pay for services like this via credit card!
O RLY? If only I had the data to check that claim... 😊 Based on a list of their domains, 132 unique Fortune 500 companies have paid for our services by credit card. The real number will be higher because many more of their domains are not on that list, or purchases have been made via an email address not on the corporate domain. Let's call it somewhere between a quarter and a third of the Fortune 500 who've puschased direct via the world's most common payment method. In other words, a significantly different number from the "zero" claim.
I've dropped the hard facts here out of both frustration from our dealings with unnecessarily artificial barriers and in support of the folks out there who, just like me in my corporate days, had to deal with "Neville" in procurement. Per that linked blog post, push back against "corporate policy" prohibiting payment by card, and statistically, you'll likely find you're not the 1 in 160 who can't make a simple payment.
Summary
We're continuing to massively invest in expanding HIBP in every way we can find. Nearly 3 billion additional breached records last year, hundreds of billions of free Pwned Passwords queries during that time, a bunch of new tweaks and features everyone gets access to and, of course, all the new stuff we've rolled into the higher plans. These new features are the culmination of a huge volume of work dating back to November, when I took this pic of our little team during our planning meeting together in Oslo.
We all hope it helps people use our Have I Been Pwned services to do more good after bad things happen.
A SoundCloud breach affecting 29.8 million accounts exposed email addresses and profile data, increasing phishing risks.
The post SoundCloud Data Breach Exposes Nearly 30M User Accounts appeared first on TechRepublic.
Remember the Ashley Madison data breach? That was now more than a decade ago, yet it arguably remains the single most noteworthy data breach of all time. There are many reasons for this accolade, but chief among them is that by virtue of the site being expressly designed to facilitate extramarital affairs, there was massive social stigma attached to it. As a result, we saw some pretty crazy stuff:Various websites were stood up to publicly disclose the presence of people in the data and out them
Remember the Ashley Madison data breach? That was now more than a decade ago, yet it arguably remains the single most noteworthy data breach of all time. There are many reasons for this accolade, but chief among them is that by virtue of the site being expressly designed to facilitate extramarital affairs, there was massive social stigma attached to it. As a result, we saw some pretty crazy stuff:
Arguably, we now live in a more privacy-conscious era, one full of acronyms such as GDPR and CCPA, among others, in different parts of the world. The right to be forgotten, the right to erasure, and, indeed, privacy as a fundamental human right feature very differently in 2026 than they did in 2015. But arguably, even back then, the impact of outing someone as a member of the site should have been obvious. It was certainly obvious to me, which is why I introduced the concept of a sensitive data breach before the data even went public. HIBP wouldn’t show results for this breach publicly because I was concerned about the impact on people being outed. My worst fear was a spouse coming home to find someone having taken their own life, an HIBP search result on the screen in front of their lifeless body.
People died as a result of the breach. Marriages ended and lives were turned upside down. People lost their jobs. The human toll of the breach was profound. The decision I made after witnessing this was that if a breach was likely to have serious personal or social consequences for people in there, it would be flagged as sensitive and not publicly searchable.
The public doxing of members of the service was often justified on a moral basis: “adultery is bad, they deserve to be outed”. But there are two massive problems with this attitude, and I’ll begin with the purpose for which accounts were sometimes made:
An email address appearing in that breach implied that the person was there to have an extramarital affair because that was literally the catch-phrase of the service: “Life is short, have an affair”. But the reality was that people were members of the service for many, many different reasons. Have a read of my post titled Here’s What Ashley Madison Members Have Told Me and you’ll begin to understand how much more nuanced the situation was:
Single people had joined the service, and later married before the breach occurred
People who were worried about a cheating spouse joined the service in order to try to catch them
Accounts were made with some people’s names and email addresses without their consent (there are many “Barrack Obamas” in the data)
So, should everyone with an email address on Ashley Madison be considered an adulterer? Clearly, no, that completely misses the nuances of what an email address in a data breach really means. But what about the people who were there to have an affair? Well, that brings us to the second problem:
Our own personal belief systems are not a valid basis for outing people publicly because their belief systems differ. I used more generic terms than “extramarital affair” or “cheating” because there are many other data breaches that are flagged as sensitive in HIBP for the very same reason. Fur Affinity, for example: there is a social stigma around furries and outing someone as a member of that community could have negative consequences for them. Rosebutt Board is another example: anal fisting is evidently something a bunch of people are into, and equally, I’m sure there are many who take a moral objection to it. And finally, to get to the catalyst for this post, WhiteDate: the website that is ostensibly designed for white people to date other white people. Flagging that as sensitive resulted in some unsavoury commentary being directed at me:
Now, I emphasised “ostensibly” because the more you dig into this breach, the more you find tones of white supremacy and other behaviours that definitely don’t align with my personal value system. That societal view doesn’t sit well with me, and I think I’m safe in saying it wouldn’t sit well with most people. Would someone being outed as a member of that service be likely to result in “serious personal or social consequences”? Yes, and you can see that in the messaging from the same account:
Context matters. U are literally shielding Nazi hate mongering scoundrels. We can't doxx white supremacists?
If ISIS had a dating site & it got breached, would you protect it out of fear of doxxing? No.
This behaviour is precisely what I don’t want HIBP being used for: as a weapon to attack people solely on the basis of their email address being affiliated with a website that has had a data breach.
personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs;
trade-union membership;
genetic data, biometric data processed solely to identify a human being;
health-related data;
data concerning a person’s sex life or sexual orientation.
An ISIS dating website breach would tick many of the boxes above and would therefore constitute a sensitive data breach. That's not an endorsement of what they stand for; it's simply a data-processing decision. But there may be a nuance in there which I didn't see present in the WhiteDate data - what if it contained illegal activity? (Sidenote: for the most part, HIBP is used by people in Western Europe, North America and Australasia, so when I say "illegal", I'm looking at it through that lens. Clearly, there are parts of the world where our "illegal" is their "normal", which further complicates how I run a service accessible from every corner of the world.) I had another example recently that went well beyond moral contention and deep into the realm of illegality:
New sensitive breach: "AI girlfriend" site Muah[.]ai had 1.9M email addresses breached last month. Data included AI prompts describing desired images, many sexual in nature and many describing child exploitation. 24% were already in @haveibeenpwned. More: https://t.co/NTXeQZFr2x
Of all the different things people can disagree on when it comes to our moral compasses, paedophilia is where we unanimously draw the line. But I still flagged it as sensitive because of the reasons outlined above. Many people using the service were just lonely guys trying to create an AI girlfriend with no prompts around age. There would be email addresses in there that weren’t entered by the rightful owner. And then, there are cases like this:
That's a firstname.lastname Gmail address. Drop it into Outlook and it automatically matches the owner. It has his name, his job title, the company he works for and his professional photo, all matched to that AI prompt. pic.twitter.com/wpXQMBLf3B
I sat there with my wife, looking at the LinkedIn profile that used the same email address as the person who posted that comment. We looked at his photo and at the veneer of professionalism that surrounded him on that site, knowing what he had written in that prompt above. It was repulsive. Further, beyond being solely an affront to our morals, it was clearly illegal. So, I had many conversations with law enforcement agencies around the world and ensured they had access to the data. Involving law enforcement where data sets contain illegal activity is absolutely the right approach here, but equally, not being the vehicle for implying someone’s affiliation or beliefs and doxing them publicly without due process is also absolutely the right approach.
I understand the gut reaction that flagging a breach like WhiteDate as sensitive protects people whom most of us do not like. But a dozen years of running this service have caused me to consider individual privacy and rights literally hundreds of times, and these conclusions aren’t arrived at hastily. Imagine for a moment, the possible ramifications for HIBP if the service were used to publicly shame someone as a "Nazi" and that, in turn, had serious real-world consequences for them. Whether that implication was right or not, there are potentially serious ramifications for us that could well leave us unable to operate at all. And, as the Ashley Madison examples show, there are also potentially life-threatening outcomes for individuals.
I don't particularly care about one random, anonymous X account making poorly thought-out statements, but the same sentiment has been expressed after loading previous similar breaches, and it deserves a blog post. Equally, I've written before about why all the other data breaches are publicly searchable and again, that conclusion is not arrived at lightly.
No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks.
Breaches with legally defined sensitive data will continue to be flagged as sensitive, and breaches with illegal data will continue to be forwarded to law enforcement agencies.
The sheer scope of cybercrime can be hard to fathom, even when you live and breathe it every day. It's not just the volume of data, but also the extent to which it replicates across criminal actors seeking to abuse it for their own gain, and to our detriment.We were reminded of this recently when the FBI reached out and asked if they could send us 630 million more passwords. For the last four years, they've been sending over passwords found during the course of their investigations in the hope t
The sheer scope of cybercrime can be hard to fathom, even when you live and breathe it every day. It's not just the volume of data, but also the extent to which it replicates across criminal actors seeking to abuse it for their own gain, and to our detriment.
We were reminded of this recently when the FBI reached out and asked if they could send us 630 million more passwords. For the last four years, they've been sending over passwords found during the course of their investigations in the hope that we can help organisations block them from future use. Back then, we were supporting 1.26 billion searches of the service each month. Now, it's... more:
Just as it's hard to wrap your head around the scale of cybercrime, I find it hard to grasp that number fully. On average, that service is hit nearly 7 thousand times per second, and at peak, it's many times more than that. Every one of those requests is a chance to stop an account takeover. But the real scale goes well beyond the API itself. Because the data model is open source and freely available, many organisations use the Pwned Passwords Downloader to take the entire corpus offline and query it directly within their own applications. That tool alone calls the API around a million times during download, but the resulting data is then queried… well, who knows how many times after that. Pretty cool, right?
This latest corpus of data came to us as a result of the FBI seizing multiple devices belonging to a suspect. The data appeared to have originated from both the open web and Tor-based marketplaces, Telegram channels and infostealer malware families. We hadn't seen about 7.4% of them in HIBP before, which might sound small, but that's 46 million vulnerable passwords we weren't giving people using the service the opportunity to block. So, we've added those and bumped the prevalence count on the other 584 million we already had.
We're thrilled to be able to provide this service to the community for free and want to also quickly thank Cloudflare for their support in providing us with the infrastructure to make this possible. Thanks to their edge caching tech, all those passwords are queryable from a location just a handful of milliseconds away from wherever you are on the globe.
If you're hitting the API, then all the data is already searchable for you. If you're downloading it all offline, go and grab the latest data now. Either way, go forth and put it to good use and help make a cybercriminal's day just that much harder 😊
Normally, when someone sends feedback like this, I ignore it, but it happens often enough that it deserves an explainer, because the answer is really, really simple. So simple, in fact, that it should be evident to the likes of Bruce, who decided his misunderstanding deserved a 1-star Trustpilot review yesterday:Now, frankly, Trustpilot is a pretty questionable source of real-world, quality reviews anyway, but the same feedback has come through other channels enough times that let's just sort th
Normally, when someone sends feedback like this, I ignore it, but it happens often enough that it deserves an explainer, because the answer is really, really simple. So simple, in fact, that it should be evident to the likes of Bruce, who decided his misunderstanding deserved a 1-star Trustpilot review yesterday:
You think you know - and Bruce thinks he knows - but you might both be wrong. To explain the answer to the question, we need to start with how HIBP ingests data, and that really is pretty simple: someone sends us a breach (which is typically just text files of data), and we run the open source Email Address Extractor tool over it, which then dumps all the unique addresses into a file. That file is then uploaded into the system, where the addresses are then searchable.
The logic for how we extract addresses is all in that Github repository, but in simple terms, it boils down to this:
There must be an @ symbol
There can be up to 64 characters before it (the alias)
There can be up to 255 characters after it (the domain)
A few other little criteria that are all documented in the public repo
That is all! We can't then tell if there's an actual mailbox behind the address, as that would require massive per-address processing, for example, sending an email to each one and seeing if it bounces. Can you imagine doing that 7 billion times?! That's the number of unique addresses in HIBP, and clearly, it's impossible. So, that means all the following were parsed as being valid and loaded into HIBP (deep links to the search result):
I particularly like that last one, as it feels like a sentiment Bruce would express. It's also a great example as it's clearly not "real"; the alias is a bit of a giveaway, as is the domain ("foo" is commonly used as a placeholder, similar to how we might also use "bar", or combine them as "foo bar"). But if you follow the link and see the breach it was exposed in, you'll see a very familiar name:
Which brings us to the next question:
How Do "Fake" Email Addresses End up in Real Websites?
This is also going to seem profoundly simple when you see it. Here goes:
Any questions, Bruce? This is just as easily explainable as why we considered it a valid address and ingested it into HIBP: the email address has a valid structure. That is all. That's how it got into Adobe, and that's how it then flowed through into HIBP.
Ah, but shouldn't Adobe verify the address? I mean, shouldn't they send an email to the address along the lines of "Hey, are you sure you want to sign up for this service?" Yes, they should, but here's the kicker: that doesn't stop the email address from being added to their database in the first place! The way this normally works (and this is what we do with HIBP when you sign up for the free notification service) is you enter the email address, the system generates a random token, and then the two are saved together in the database. A link with the token is then emailed to the address and used to verify the user if they then follow that link. And if they don't follow that link? We delete the email address if it hasn't been verified within a few days, but evidently, Adobe doesn't. Most services don't, so here we are.
How Can I Be Really Sure Actual Fake Addresses Aren't in HIBP?
This is also going to seem profoundly obvious, but genuinely random email addresses(not "thisisfuckinguseless@") won't show up in HIBP. Want to test the theory? Try 1Password's generator (yes, Bruce, they also sponsor HIBP):
Huh, would you look at that? And you can keep doing that over and over again. You’ll get the same result because they are fabricated addresses that no one else has created or entered into a website that was subsequently breached, ipso facto proving they cannot appear in the dataset.
Conclusion
Today is HIBP's 12th birthday, and I've taken particular issue with Bruce's review because it calls into question the integrity with which I run this service. This is now the 218th blog post I've written about HIBP, and over the last dozen years, I've detailed everything from the architecture to the ethical considerations to how I verify breaches. It's hard to imagine being any more transparent about how this service runs, and per the above, it's very simple to disprove the Bruces of the world. If you've read this far and have an accurate, fact-based review you'd like to leave, that'd be awesome 😊
I hate hyperbolic news headlines about data breaches, but for the "2 Billion Email Addresses" headline to be hyperbolic, it'd need to be exaggerated or overstated - and it isn't. It's rounded up from the more precise number of 1,957,476,021 unique email addresses, but other than that, it's exactly what it sounds like. Oh - and 1.3 billion unique passwords, 625 million of which we'd never seen before either. It's the most extensive corpus of data we've ever processed, by a significant margin.Edit
I hate hyperbolic news headlines about data breaches, but for the "2 Billion Email Addresses" headline to be hyperbolic, it'd need to be exaggerated or overstated - and it isn't. It's rounded up from the more precise number of 1,957,476,021 unique email addresses, but other than that, it's exactly what it sounds like. Oh - and 1.3 billion unique passwords, 625 million of which we'd never seen before either. It's the most extensive corpus of data we've ever processed, by a significant margin.
Edit: Just to be crystal clear about the origin of the data and the role of Synthient (who you’ll read about in the next paragraph): this data came from numerous locations where cybercriminals had published it. Synthient (run by Ben during his final year of college) indexed that data and provided it to Have I Been Pwned solely for the purpose of notifying victims. He’s the good guy shining a light on the bad guys, so keep that in mind as you read on. (Some of the feedback Ben has received is exactly what I foreshadowed in the final paragraph of this post.)
A couple of weeks ago, I wrote about the 183M unique email addresses that Synthient had indexed in their threat intelligence platform and then shared with us. I explained that this was only part of the corpus of data they'd indexed, and that it didn't include the credential stuffing records. Stealer log data is obtained by malware running on infected machines. In contrast, credential stuffing lists usually originate from other data breaches where email addresses and passwords are exposed. They're then bundled up, sold, redistributed, and ultimately used to log in to victims' accounts. Not just the accounts they were initially breached from, either, because people reuse the same password over and over again, the data from one breach is frequently usable on completely unrelated sites. A breach of a forum to comment on cats often exposes data that can then be used to log in to the victim's shopping, social media and even email accounts. In that regard, credential stuffing data becomes "the keys to the castle".
Let me run through how we verified the data, what you can do about it and for the tech folks, some of the hoops we had to jump through to make processing this volume of data possible.
Data Verification
The first person whose data I verified was easy - me 😔 An old email address I've had since the 90s has been in credential stuffing lists before, so it wasn't too much of a surprise. Furthermore, I found a password associated with my address, which I'd definitely used many eons ago, and it was about as terrible as you'd expect from that era. However, none of the other passwords associated with my address were familiar. They certainly looked like passwords that other people might have feasibly used, but I'm pretty sure they weren't mine. One was even just an IP address from Perth on the other side of the country, which is both infeasible as a password I would have used, yet eerily close to home. I mean, of all the places in the world an IP address could have appeared from, it had to be somewhere in my own country I've been many times before...
Moving on to HIBP subscribers, I reached out to a handful and asked for support verifying the data. I chose a mix of subscribers with many who'd never been involved in any data breach we'd ever seen before; my experience above suggested that there's recycled data in there, and we had previously verified that when investigating those other incidents. However, is the all-new stuff legitimate? The very first response I received was exactly what I was looking for:
#1 is an old password that I don't use anymore. #2 is a more recent password. Thanks for the heads up, I've gone and changed the password for every critical account that used either one.
Perfectly illustrating most people's behaviour with passwords, #2 referred to above was just #1 with two exclamation marks at the end!! (Incidentally, these were simple six and eight-character passwords, and neither of them was in Pwned Passwords either.) He had three passwords in total, which also means one of them, like with my data, was not familiar. However, the most important thing here is that this example perfectly illustrates why we put the effort into processing data like this: #2 was a real, live password that this guy was actively using, and it was sitting right next to his email address, being passed around among criminals. However, through this effort, that credential pair has now become useless, which is precisely what we're aiming for with this exercise, just a couple of billion times over.
The second respondent only had one password against their address:
Yes that was a password I used for many years for what I would call throw away or unimportant accounts between 20 and 10 years ago
That was also only eight characters, but this time, we'd seen it in Pwned Passwords many times before. And the observation about the password's age was consistent with my own records, so there's definitely some pretty old data in there.
The following response was not at all surprising:
I am familiar with that password... I used it almost 10 years ago... and cannot recall the last time I used it.
That was on a corporate account, too, and the owner of the address duly forwarded my email to the cybersecurity team for further investigation. The single password associated with this lady's email address had a massive nine characters, and also hadn't previously appeared in Pwned Passwords.
Next up was a respondent who replied inline to my questions, so I'll list them below with the corresponding answers:
Is this familiar? Yes
Have you ever used it in the past? Yes and is still on some accounts I do not use any longer.
And if so, how long ago? Unfortunately, it is still on some active accounts that I have just made a list of to change or close immediately.
This individual's eight-character password with uppercase, lowercase, numbers and a "special" character also wasn't in Pwned Passwords. Similarly, as with the earlier response, that password was still in active use, posing a real risk to the owner. It would pass most password complexity criteria and slip through any service using Pwned Passwords to block bad ones, so again, this highlights why it was so important for us to process the data.
The next person had three different passwords against rows with their email address, and they came back with a now common response:
Yes, these are familiar, last used 10 years ago
We'd actually seen all three of them in Pwned Passwords before, many times each. Another respondent with precisely the kind of gamer-like passwords you'd expect a kid to use (one of which we hadn't seen before), also confirmed (I think?) their use:
maybe when i was a kid lol
Responses that weren't an emphatic "yes, that's my data" were scarce. The two passwords against one person's name were both in Pwned Passwords (albeit only once each), yet it's entirely possible that neither of them had been used by this specific individual before. It's also possible they'd forgotten a password they'd used more than a decade ago, or it may have even been automatically assigned to them by the service that was subsequently breached. Put it down as a statistical anomaly, but I thought it was worth mentioning to highlight that being in this data set isn't a guarantee of a genuine password of yours being exposed. If your email address is found in this corpus then that's real, of course, so there must be some truth in the data, but it's a reminder that when data is aggregated from so many different sources over such a long period of time, there's going to be some inconsistencies.
Searching Pwned Passwords
As a brief recap, we load passwords into the service we call Pwned Passwords. When we do so, there is absolutely no association between the password and the email address it appeared next to. This is for both your protection and ours; can you imagine if HIBP was pwned? It's not beyond the realm of possibility, and the impact of exposing billions of credential pairs that can immediately unlock an untold number of accounts would be catastrophic. It's highly risky, and completely unnecessary when you can search for standalone passwords anyway without creating the risk of it being linked back to someone.
Think about it: if you have a password of "Fido123!" and you find it's been previously exposed (which it has), it doesn't matter if it was exposed against your email address or someone else's; it's still a bad password because it's named after your dog followed by a very predictable pattern. If you have a genuinely strong password and it's in Pwned Passwords, then you can walk away with some confidence that it really was yours. Either way, you shouldn't ever use that password again anywhere, and Pwned Passwords has done its job.
Checking the service is easy, anonymous and depending on your level of technical comfort, can be done in several different ways. Here's a copy and paste from the last Synthient blog post:
Use the Pwned Passwords search page. Passwords are protected with an anonymity model, so we never see them (it's processed in the browser itself), but if you're wary, just check old ones you may suspect.
Use the k-anonymity API. This is what drives the page in the previous point, and if you're handy with writing code, this is an easy approach and gives you complete confidence in the anonymity aspect.
Use 1Password's Watchtower. The password manager has a built-in checker that uses the abovementioned API and can check all the passwords in your vault. (Disclosure: 1Password is a regular sponsor of this blog, and has product placement on HIBP.)
My vested interest in 1Password aside, Watchtower is the easiest, fastest way to understand your potential exposure in this incident. And in case you're wondering why I have so many vulnerable and reused passwords, it's a combination of the test accounts I've saved over the years and the 4-digit PINs some services force you to use. Would you believe that every single 4-digit number ever has been pwned?! (If you're interested, the ABC has a fantastic infographic using a heatmap based on HIBP data that shows some very predictable patterns for 4-digit PINs.)
This Is Not a Gmail Breach
It pains me to say it, but I have to, given the way the stealer logs made ridiculous, completely false headlines a couple of weeks ago:
This story has suddenly gained *way* more traction in recent hours, and something I thought was obvious needs clarifying: this *is not* a Gmail leak, it simply has the credentials of victims infected with malware, and Gmail is the dominant email provider: https://t.co/S75hF4T1es
There are 32 million different email domains in this latest corpus, of which gmail.com is one. It is, of course, the largest and has 394 million unique email addresses on it. In other words, 80% of the data in this corpus has absolutely nothing to do with Gmail, and the 20% of Gmail addresses have absolutely nothing to do with any sort of security vulnerability on Google's behalf. There - now let reporting sanity prevail!
The Technical Bits
I wanted to add this just to highlight how painful it has been to deal with this data. This corpus is nearly 3 times the size of the previous largest breach we'd loaded, and HIBP is many times larger than it was in 2019 when we loaded the Collection #1 data. Taking 2 billion records and adding the ones we hadn't already seen in the existing 15 billion corpus, whilst not adversely impacting the live system serving millions of visitors a day, was very non-trivial. Managing the nuances of SQL Server indexes such that we could optimise both inserts and queries is not my idea of fun, and it's been a pretty hard couple of weeks if I'm honest. It's also been a very expensive period as we turned the cloud up to 11 (we run on Azure SQL Hyperscale, which we maxed out at 80 cores for almost two weeks).
A simple example of the challenge is that after loading all the email addresses up into a staging table, we needed to create SHA1 hashes of each. Normally, that would involve something to the effect of "update table set column = sha1(email)" and you're done. That crashed completely, so we ended up doing "insert into new table select email, sha1(email)". But on other occasions the breach load required us to do updates on other columns (with no hash creation), which, on mulitple occasions, we had to kill after a day or more of execution with no end in sight. So, we ended up batching in loops (usually 1M records at a time), reporting on progress along the way so we had some idea of when it would actually finish. It was a painful process of trial, waiting ages, error then taking a completely different approach.
Notifying our subscribers is another problem. We have 5.9 million of them, and 2.9 million are in this data 🫨 Simply sending that many emails at once is hard. It's not so much hard in terms of firing them off, rather it's hard in terms of not ending up on a reputation naughty list or having mail throttled by the receiving server. That's happened many times in the past when loading large, albeit much smaller corpuses; Gmail, for example, suddenly sees a massive spike and slows down the delivery to inboxes. Not such a biggy for sending breach notices, but a major problem for people trying to sign into their dashboard who can no longer receive the email with the "magic" link.
What we've done to address that for this incident is to slow down the delivery of emails for the individual breach notification. Whilst I'd originally intended to send the emails at a constant rate over the period of a week, someone listening to me on my Friday live stream had a much better suggestion:
the strategy I've found to best work with large email delivery is to look at the average number of emails you've sent over the last 30 days each time you want to ramp up, and then increase that volume by around 50% per day until you've worked your way through the queue
Which makes a lot of sense, and stacked up as I did more research (thanks Joe!). So, here's what our planned delivery schedule now looks like:
That's broken down by hour, increasing in volume by 1.015 times per hour, such that the emails are spread out in a similar, gradually increasing cadence. On a daily basis, that works out at a 45% increase in each 24-hour period, within Joe's suggested 50% threshold. Plus, we obviously have all the other mechanisms such as a dedicated IP, properly configured DKIM, DMARC and SPF, only emailing double-opted-in subscribers and spam-friendly message body construction. So, it could be days before you receive a notification, or just run a haveibeenpwned.com search on demand if you're impatient.
We've sent all the domain notification emails instantly because, by definition, they're going to a very wide range of different mail servers; it's just the individual ones we're drop-feeding.
Lastly, if you've integrated Pwned Passwords into your service, you'll now see noticeably larger response sizes. The numbers I mentioned in the opening paragraph increase the size of each hash range by an average of about 50%, which will push responses from about 26kb to 40kb. That's when brotli compressed, so obviously, make sure you're making requests that make the most of the compression.
Conclusion
This data is now searchable in HIBP as the Synthient Credential Stuffing Threat Data. It's an entirely separate corpus from that previous Synthient data I mentioned earlier; they're discrete datasets with some crossover, but obviously, this one is significantly larger. And, of course, all the passwords are now searchable per the Pwned Passwords guidance above.
If I could close with one request: this was an extremely laborious, time-consuming and expensive exercise for us to complete. We've done our best to verify the integrity of the data and make it searchable in a practical way while remaining as privacy-centric as possible. Sending as many notifications as we have will inevitably lead to a barrage of responses from people wanting access to complete rows of data, grilling us on precisely where it was obtained from or, believe it or not, outright abusing us. Not doing those things would be awesome, and I suggest instead putting the energy into getting a password manager, making passwords strong and unique (or even better, using passkeys where available), and turning on multi-factor auth. That would be an awesome outcome for all 😊
Edit: I've closed off comments on this blog post. As you'll see below, there was a constant stream of questions that have already been answered in the post itself, plus some comments that were starting to verge on precisely what I predicted in the last para above. Reading, responding and engaging is time-consuming and at this point, all the answers are already here both above and below this edit in the comments.
Where is your data on the internet? I mean, outside the places you've consciously provided it, where has it now flowed to and is being used and abused in ways you've never expected? The truth is that once the bad guys have your data, it often replicates over and over again via numerous channels and platforms. If you're able to aggregate enough of it en masse, you end up with huge volumes of "threat intelligence data", to use the industry buzzword. And that's precisely what Ben from Synthient has
Where is your data on the internet? I mean, outside the places you've consciously provided it, where has it now flowed to and is being used and abused in ways you've never expected? The truth is that once the bad guys have your data, it often replicates over and over again via numerous channels and platforms. If you're able to aggregate enough of it en masse, you end up with huge volumes of "threat intelligence data", to use the industry buzzword. And that's precisely what Ben from Synthient has done, and then sent it to Have I Been Pwned (HIBP).
Ben is in his final year of college in the US and is carving out a niche in threat intelligence. He's written up a deeper dive in The Stealer Log Ecosystem: Processing Millions of Credentials a Day, but the headline gives you a sense of the volumes. Have a read of that post and you'll see Ben is pulling data from various sources, including social media, forums, Tor and, of course, Telegram. He's managed to aggregate so much of it that by the time he sent it to us, it was rather sizeable:
That's 3.5 terrabytes of data, with the largest file alone being 2.6TB and, combined, they contain 23 billion rows. It's a vast corpus, and if we were attempting to compete with recent hyperbolic headlines about breach sizes, this would be one of the largest. But I'm not going to play the "mine is bigger than yours" game because it makes no sense once you start analysing the data. Part of what makes the data so large is that we're actually looking at both stealer logs and credential stuffing lists, so let's assess them separately, starting with those stealer logs.
Stealer Logs
Stealer logs are the product of infostealers, that is, malware running on infected machines and capturing credentials entered into websites on input. The output of those stealer logs is primarily three things:
Website address
Email address
Password
Someone logging into Gmail, for example, ends up with their email address and password captured against gmail.com, hence the three parts. Due to the fact that stealer logs are so heavily recycled (they're posted over and over again to the sorts of channels Ben monitors), the first thing we always do is try to get a sense of how much is genuinely new:
This is the output of a little PowerShell script we use to guage where the email addresses in a new breach corpus have been seen before. Especially when there's a suspicion that data might have been repurposed from elsewhere, it's really useful to run them against the HIBP API and see what comes back. What the output above tells us is that after checking a sample of 94k of them, 92% had been previously seen, mostly in stealer log corpuses we'd loaded in the past. This is an empirical demonstration of what I wrote in the opening paragraph - "it often replicates over and over again" - and as you can see, most of what has been seen before was in the ALIEN TXTBASE stealer logs.
Back to the console output again, and having previously seen 92% of addresses also means we haven't seen 8% of the addresses. That's 8% of a considerable number, too: we found 183M unique email addresses across Ben's stealer log data, so we're talking about 14M+ addresses that have never surfaced in HIBP. (The final number once the entire data set was loaded into HIBP was 91% pre-existing, with 16.4M previously unseen addresses in any data breach, not just stealer logs.) But as with everything we load, the question has to be asked: Is it legit? Can you trust the shady criminals who publish this data not to fill it with junk? The only way to know for sure is to ask the legitimate owners of the data, so I reached out to a bunch of our subscribers and sought their support in verifying.
One of the respondants was already concerned there could be something wrong with his Gmail account and sure enough, he had one stealer log entry for "https://accounts.google.com/signin/challenge/pwd/1" with a, uh, "suboptimal" password:
Yes I can confirm that was an accurate password on my gmail account a few months ago
Another respondant who offered support had somewhat of a recognisable pattern in the sites he'd been visiting:
To his credit, he responded and confirmed that the list did indeed contain sites he'd visited, which also included online casinos, crypto websites and VPN services:
They all look like websites I have used and some still do use
As it turns out, he also had two other email addresses in the corpus of data, both with the same collection of passwords used on the first address he replied from. They also both aligned to services based on the same TLD as the other email address which suggested which country he's located in. (Incidentally, the online privacy offered by VPNs kinda falls apart when there's malware on your machine watching every site you visit and recording your credentials.)
Even without a response from a subscriber, it's still easy to get a sense of the legitimacy of the data in a privacy-preserving fashion (i.e. not logging in with their credentials!) just by testing enumeration vectors. For example, one subscriber had an account at ShopBack in the Philippines which offers what I'll refer to as "account enumeration as a service":
I simply added some character's in front of the email address and ShopBack happily confirmed that address didn't exist. However, remove the invalid characters and there's a very different response:
All of these little "tells" add up; another subscriber had a high prevalence of Greek websites they used, showing exactly the sort of pattern you'd expect to see for someone from that corner of the world. Another had various online survey sites they'd used, and like our "assandfurious" friend from earlier, a clear pattern emerged consistent with the apparent interests of the address's owner. Time and time again, the data checked out, so we loaded it. Those 183M email addresses are now searchable in HIBP, and the passwords are also searchable in Pwned Passwords, which has become rather popular:
Pwned Passwords just served 17.45 billion requests in 30 days 🤯 That's an *average* of 6,733 requests per second, but at our peak, we're hitting 42k per second in a 1-minute block. Crazy numbers! Made possible by @Cloudflare 😎 pic.twitter.com/Io6u1PiqJf
The website addresses are also now searchable, either in the stealer log section of your personal dashboard or by verified domain owners using the API. You'll find this data named "Synthient Stealer Log Threat Data" in HIBP, but stealer logs are only part of the Synthient story - the small part!
Credentials Stuffing
Ben's data also contained credential stuffing lists. Unlike stealer logs, which are the product of malware on the victim's machine, credential stuffing lists are typically aggregated from other places where email address and password pairs are obtained. For example, from data breaches where the passwords are either stored in plain text or protected with easily crackable hashing algorithms. Those lists are then used to access the other accounts of victims where they've reused their passwords.
Quick sidenote: Credential stuffing lists can be enormously damaging because they contain the keys to so many different services. Not only are they the gateway to so many takeovers of social media accounts, email addresses and other valuable personal resources, they're also responsible for many subsequent very serious data breaches. The 2017 Uber breach was attributed to previously breached employee credentials. Five years later, and the same approach provided the initial access to Uber again, after which MFA-bombing sealed the deal. Then there was the 23andMe breach in 2023, which was also traced back to credential stuffing. Similar but different was when Dunkin' Donuts had 20k customer details exposed in a show of how multifaceted this style of attack is: they were subsequently sued for not having sufficient controls to stop hackers from simply logging in with victims' legitimate credentials. It's wild; it's the attack that just keeps on giving.
Ever since loading Collection #1 in 2019, I have been extra cautious about dealing with credential stuffing lists. The 400+ comments on that blog post will give you just a little taste of how much attention that exercise garnered. Frankly, it was a significant contributor to the feeling that it was all getting a bit too much, leading to the decision that HIBP needed to find another home (which fortunately, never eventuated). The primary issue with credential stuffing lists is that we can't attribute a given row to a specific source website or data breach, and we don't offer a service to look up credential pairs. As you'll see from many of the comments on that post, I had angry people upset that, without knowing specifically which password was exposed in the list, the knowledge that they were in there was not actionable. I disagree, because by loading those passwords into Pwned Passwords, there are now three easy ways to check if you're using a vulnerable one:
Use the Pwned Passwords search page. Passwords are protected with an anonymity model, so we never see them (it's processed in the browser itself), but if you're wary, just check old ones you may suspect.
Use the k-anonymity API. This is what drives the page in the previous point, and if you're handy with writing code, this is an easy approach and gives you complete confidence in the anonymity aspect.
Use 1Password's Watchtower. The password manager has a built-in checker that uses the abovementioned API and can check all the passwords in your vault. (Disclosure: 1Password is a regular sponsor of this blog, and has product placement on HIBP.)
My vested interest in 1Password aside, Watchtower is the easiest, fastest way to understand your potential exposure in this incident. And in case you're wondering why I have so many vulnerable and reused passwords, it's a combination of the test accounts I've saved over the years and the 4-digit PINs some services force you to use. Would you believe that every single 4-digit number ever has been pwned?! (If you're interested, the ABC has a fantastic infographic using a heatmap based on HIBP data that shows some very predictable patterns for 4-digit PINs.)
As of the time of publishing this blog post, only the stealer logs have been loaded, and as mentioned earlier, the data in HIBP has been called "Synthient Stealer Log Threat Data". We intend to load the credential stuffing data as a separate corpus next week and call it "Synthient Credential Stuffing Threat Data", assuming it's sufficiently new and the accuracy is confirmed with our subscribers! We're doing this in two parts simply because of the scale of the data and the fact that we want to break it into two discrete corpuses given the data originates via different means. I'll revise this blog post accordingly after we finish our analysis.
Future
Something that is becoming more evident as we load more stealer logs is that treating them as a discrete "breach" is not an accurate representation of how these things work. The truth is that, unlike a single data breach such as Ashley Madison, Dropbox, or the many other hundreds already in HIBP, stealer logs are more of a firehose of data that's just constantly spewing personal info all over the place. That, combined with the duplication of previously seen data, means that we need a rethink on this model. The data itself is still on point, but I'd like to see HIBP better reflect that firehose analogy and provide a constant stream of new data. Until then, Synthient's Threat Data will still sit in HIBP and be searchable in all the usual ways.
You see it all the time after a tragedy occurs somewhere, and people flock to offer their sympathies via the "thoughts and prayers" line. Sympathy is great, and we should all express that sentiment appropriately. The criticism, however, is that the line is often offered as a substitute for meaningful action. Responding to an incident with "thoughts and prayers" doesn't actually do anything, which brings us to court injunctions in the wake of a data breach.Let's start with HWL Ebsworth, an Austra
You see it all the time after a tragedy occurs somewhere, and people flock to offer their sympathies via the "thoughts and prayers" line. Sympathy is great, and we should all express that sentiment appropriately. The criticism, however, is that the line is often offered as a substitute for meaningful action. Responding to an incident with "thoughts and prayers" doesn't actually do anything, which brings us to court injunctions in the wake of a data breach.
The final interlocutory injunction restrained hackers from the ALPHV, or “BlackCat”, hackers group from publishing the HWL data on the internet, sharing it with any person, or using the information for any reason other than for obtaining legal advice on the court’s orders.
To paraphrase, the injunction prohibits the Russian crime gang that hacked the law firm and attempted to extort them from publishing the data on the internet. Right... The threat actor was subsequently served with the injunction, to which, per the article, they responded in an entirely predictable fashion:
Fuck you fuckers
And then they dumped a huge trove of data. Clearly, criminals aren't going to pay any attention whatsoever to an injunction, but this legal construct has reach far beyond just the bad guys:
The injunction will also “assist in limiting the dissemination of the exfiltrated material by enabling HWLE to inform online platforms, who are at risk of publishing the material”, Justice Slattery said.
In other words, the data is also off limits to the good guys. Journalists, security firms and yes, Have I Been Pwned (HIBP) are all impacted by injunctions like this. To some extent, you can understand this when the data is as sensitive as what a law firm typically holds, and you need only use a little bit of imagination to picture how damaging it can be for data like this to fall into the wrong hands. But data in a breach of a company like Qantas is very different:
And now here’s mine. Still no indication of specifically which service was breached, but feels very much like loyalty program data (i.e. nothing to do with specific flights, password, passport or payment details). pic.twitter.com/r7KnlfM8TV
As well as my interest in running HIBP, I also appear to be a victim of their data breach, along with my wife and kids. And just to highlight how much skin I have in the game, I'm also a Qantas shareholder and a very loyal customer:
Sitting at the airport about to take my 301st (tracked) @Qantas flight. Nice banter with the staff: “you can lose my data, just don’t lose my bags” 😬 pic.twitter.com/ZGxc4I0aB1
As such, I was particularly interested when they applied for, and were granted, a court injunction of their own. Why? What possible upside does this provide? Because by now, it's pretty clear what's going to happen to the data:
This is from a Telegram channel run by the group that took the Qantas data, along with some other huge names:
🚨🚨🚨BREAKING - New data leak site by Scattered LAPSUS$ Hunters exposes Salesforce customers. Dozens of global companies involved in a large-scale extortion campaign.
Scattered LAPSUS$ Hunters claims to have breached Salesforce, exfiltrating ~1B records. They accuse Salesforce… pic.twitter.com/u2PAO7miyP
"Scattered LAPSUS$ Hunters" is threatening to dump all the data publicly in a couple of days' time unless a ransom is paid, which it won't be. The quote from the Telegram image is from a Qantas spokesperson, and clearly, the injunction is not going to stop the publishing of data. Much of my gripe with injunctions is the premise that they in some way protect customers (like me), when clearly, they don't. But hey, "thoughts and prayers", right?
Without wanting to give too much credit to criminals attempting to ransom my data (and everyone else's), they're right about the media outlets. An injunction would have had a meaningful impact on the Ashley Madison coverage a decade ago, where the press happily outed the presence of famous people in the breach. Clearly, the Qantas data is nowhere near as newsworthy, and I can't imagine a headline going much beyond the significant point balances of certain politicians. The data just isn't that interesting.
The injunction is only effective against people who meet the following criteria:
People who know there's an injunction in place
People who are law-abiding
People in Australia *
The first two points are obvious, and an asterix adorns the third as it's very heavily caveated. This from a chat with a lawyer friend thir morning who specialises in this space:
it would depend on which country and whether it has a reciprocal agreement with Australia eg like the UK and also who you are trying it enforce it against and then it’s up to the court in that country to determine - but as this is an injunction (so not eg for a debt against a specific person) it’s almost impossible - you can’t just register a foreign judgement somewhere against the world at large as far as I know.
Where that confidentiality is breached due to a hack, parties should generally do - and be seen to be doing - what they can to prevent or minimise the extent of harm. Even if injunctions might not impact hackers, for the reasons set out above, they can provide ancillary benefits in relation to the further dissemination of hacked information by legitimate individuals and organisations. Depending on the terms, it might also assist with recovery on relevant insurance policies and reduce the risk of securities class actions being brought.
That term - "be seen to be doing" - says it all. This is now just me speculating, but I can envisage lawyers for Qantas standing up in court when they're defending against the inevitable class actions they'll face (which I also have strong views on), saying "Your honour, we did everything we could, we even got an injunction!" In a previous conversation I had regarding another data breach that had successfully been granted an injunction, I was told by the lawyer involved that they wanted to assure customers that they'd done everything possible. That breach was subsequently circulated online via a popular clear web hacking site (not "the dark web"), but I assume this fact and the ineffectiveness of the injunction on that audience was left out of customer communications. I feel pretty comfortable arguing that the primary beneficiary of the injunction is the shareholder, rather than the customer. And I assume the lawyers charge for their time, right?
Where this leaves us with Qantas is that, on a personal note, as a law-abiding Australian who is aware of the injunction, I won't be able to view my data or that of my kids. I can always request it of Qantas, of course, but I won't be able to go and obtain it if and when it's spread all over the internet. The criminals will, of course, and that's a very uncomfortable feeling.
From an HIBP perspective, we obviously can't load that data. It's very likely that hundreds of thousands of our subscribers will be impacted, and we won't be able to let them know (which is part of the reason I've written this post - so I can direct them here when asked). Granted, Qantas has obviously sent out disclosure notices to impacted individuals, but I'd argue that the notice that comes from HIBP carries a different gravitas: it's one thing to be told "we've had a security incident", and quite another to learn that your data is now in circulation to the extent that it's been sent to us. Further, Qantas won't be notifying the owners of the domains that their customers' email addresses are on. Many people will be using their work email address for their Qantas account, and when you tie that together with the other exposed data attributes, that creates organisational risk. Companies want to know when corporate assets (including email addresses) are exposed in a data breach, and unfortunately, we won't be able to provide them with that information.
I understand that Qantas' decision to pursue the injunction is about something much broader than the email addresses potentially appearing in HIBP. I actually think much of the advice Qantas has given is good, for example, the resources they've provided on their page about the breach:
These are all fantastic, and each of them has many good external resources people worried about scams should refer to. For example, ScamWatch has this one:
The scam resources Qantas recommends all link through to a service that will never return the Qantas data breach. Did I mention "thoughts and prayers" already?
Update, 14 Oct 2025
As threatened, the Qantas data was dumped publicly 2 days after writing this post. The data appeared on a clear web file sharing service linked to by both their .onion website and a clear web site that popped up on a new domain shortly after the data was publicised. Due to the injunction, I've not accessed the data myself but have had security folks in other parts of the world reach out and confirm my record and that of my family members is present. I've also seen public commentary from other researchers analysing the data, and have had multiple people contact me and offer to send it.
Clearly, the injunction has proven to be extremely limited in its ability to stop the spread of data. Further, as a Qantas customer, I've not heard anything from them in relation to my data having now been publicly released. None of this should surprise anybody, including Qantas.
As to questions in the comments about the legitimacy of the injunction and where it can be obtained, the advice I've obtained from the law firm we use is that it is absolutely legitimate and interested parties would need to contact Qantas if they want to see it (likely a redacted version). I'm not savvy with the mechanics of how courts issue these and why they're not more publicly accessible, but I suggest this story with quotes from Justice Kunc is worth a read. Frankly, it's all a bit nuts, but that's the environment we're operating in and the rules we need to adhere to.
It's hard to explain the significance of CERN. It's the birthplace of the World Wide Web and the home of the largest machine ever built, the Large Hadron Collider. The bit that's hard to explain is, well, I mean, look at it!Charlotte and I visited CERN in 2019, nestled in there between Switzerland and France, and descended into the mountainside where we saw the world's largest particle accelerator firsthand. I can't explain this! The physics are just mind-bending.A few months ago, we headed back
Charlotte and I visited CERN in 2019, nestled in there between Switzerland and France, and descended into the mountainside where we saw the world's largest particle accelerator firsthand. I can't explain this! The physics are just mind-bending.
A few months ago, we headed back there and saw even more stuff I can't explain:
How on earth do you make antimatter?! I know there's a lot of magnets involved, but that's about the limit of my understanding.
But what I do understand a little better is the importance of CERN. They're working to help humanity understand the most profound questions about the universe by exploring fundamental physics—the very building blocks of nature. And closer to my heart (or at least to my expertise), their role in the World Wide Web and the contribution CERN has made to the internet as we know it today cannot be overstated. It's also staffed by passionate individuals with a love of science that transcends borders and politics, including many from parts of the world that don't normally see eye-to-eye. This passion was evident on both our visits, and perhaps that's an extra poignant observation in a time with so much conflict.
In relation to HIBP and our ongoing support of governments, CERN is similar yet different. It's an intergovernmental organisation operating outside the jurisdiction of any one nation. However, they face the same online threats, and just like sovereign government states, their people sign up to services that get breached and end up in HIBP. And, like the governments we support, services that can be provided to help them tackle that threat are always appreciated. I was surprised to hear on our last visit that the sum total of contributions from their member states amounts to the price of a cup of coffee per person per year! For the work they do and the contribution they make to society, onboarding CERN as the 41st (inter)government was a no-brainer. They now have full and free access to query all CERN domains across the breadth of HIBP data. Welcome aboard CERN!
One of the most common use cases for HIBP's API is querying by email address, and we support hundreds of millions of searches against this endpoint every month. Loads of organisations use this service to understand the exposure of their customers and provide them with better protection against account takeover attacks. Many also use it to support customers who've already fallen victim - "hey, did you know HIBP says you're in 7 data breaches, any chance you've been reusing passwords?" Some compan
One of the most common use cases for HIBP's API is querying by email address, and we support hundreds of millions of searches against this endpoint every month. Loads of organisations use this service to understand the exposure of their customers and provide them with better protection against account takeover attacks. Many also use it to support customers who've already fallen victim - "hey, did you know HIBP says you're in 7 data breaches, any chance you've been reusing passwords?" Some companies even use it to help establish the legitimacy of an email address; we're all so pwned that if an address isn't pwned, maybe it isn't even real.
The latest video demo walks you through how to use this API and introduces something new that has been requested for years: a test API key. We've had this request so many times, and my response has usually been something to the effect of "mate, a key is a few bucks, just get a cheapie and start writing code". However, even if it were just a few cents, it would still pose a burden to some for various reasons. So, today we're also launching a test key:
hibp-api-key: 00000000000000000000000000000000
The test key can only be used for queries against the test accounts (and we've had those for many years now), but it allows developers to start immediately writing code against the real live APIs. The technical implementation is identical to the key you get when you have a paid subscription, so this should help a bunch of people really fast-track their development and remove that one little barrier we previously had. Here's how it all works:
So, that's the breached account API, and it comes off the back of last week's first demo, showing how domain searches work. We've got a heap more to add yet and I'd love to hear about and others you feel would help you get the most out of the service.
Well, one of them is, but what's important is that we now have a platform on which we can start pushing out a lot more. It's not that HIBP is a particularly complex system that needs explaining in any depth, but we still get a lot of "how do I..." style questions for the fundamentals. Stuff like "how do I search our domain", which is why that's now the very first video we have in the series:
You'll also find this on the brand new demos page at haveibeenpwned.com/Demos where you'll soon be see
Well, one of them is, but what's important is that we now have a platform on which we can start pushing out a lot more. It's not that HIBP is a particularly complex system that needs explaining in any depth, but we still get a lot of "how do I..." style questions for the fundamentals. Stuff like "how do I search our domain", which is why that's now the very first video we have in the series:
You'll also find this on the brand new demos page at haveibeenpwned.com/Demos where you'll soon be seeing many more examples that'll start with the basics, then become increasingly complex. The APIs in particular are the source of many support tickets, and we hope that these demos simplify them for the masses and save us some ticketing overhead in the process.
The demo is only five and a bit minutes, and I want to keep each one pretty succinct. If there's something you'd like to see explained, please drop me a comment below, and I'll do my best to create some material on it. In the meantime, check out the brand new HIBP YouTube channel and give it some love, there's a lot more coming.
Incidentally, in checking the stats whilst preparing this, it seems that we now have 357k instances of someone monitoring a domain 😲 That includes almost a quarter of the world's top 1k largest domains too, so this is a very heavily used feature and was a logical place to get started.