Linking 23 million active mortgage records to borrower demographics

The remarkable privacy, mortgage refinance, and MBS prepayment implications

Dec 30, 2024

∙ Paid

Mortgagor privacy and the economics of refinance lending (when it eventually returns) are a mere data terms of use restriction away from being blown wide open. The prepayment anticipation abilities of Mortgage-Backed Securities (“MBS”) traders/investors are about to be significantly enhanced, if they haven’t been already from those in the know. All made possible by publicly available data sets (albeit terms of use restricted).

Building a robust dataset of the country’s mortgages, Part 1 the MBS data

Over the past couple of months, I have downloaded records for most of the countries outstanding mortgages, roughly 39 million of them. This data comes directly from the issuers/backers/guarantors of Mortgage-Backed Securities (“MBS”), specifically Ginnie Mae (“GNMA”) for FHA and VA loans, and Fannie Mae (“FNMA”) and Freddie Mac (“FHLMC”) for conventional loans. This loan level data for loans in outstanding MBS pools includes core characteristics about the loan (think amount, term, interest rate, property state, often times who originated it etc.) and some additional characteristics provided to help MBS investors evaluate prepayment and default risk1 such as loan to value at origination, borrower credit score at origination, and the debt to income ratio for the borrower at origination. This data is updated every month including whether the loan is current or slipping into delinquency. On a side note, I do this and everything I will describe below, not in the cloud or on some expensive and fancy server, but rather on an hp “business” desktop that could be bought on Black Friday for 800 bucks. Big data just doesn’t seem so big anymore.

Building a robust dataset of the country’s mortgages, Part 2 the HMDA data

I have also downloaded every Home Mortgage Disclosure Act (“HMDA”) record for FHA, VA, and conventional mortgages originated since HMDA data collection started in 2018 through 20232. So not all currently outstanding mortgages have a corresponding HMDA record, but most of them do.

Similar to the MBS disclosure data, the HMDA data includes the same core loan characteristics and shares a few of the additional ones like loan to value and debt to income ratio. HMDA data also contains additional information including:

much more specific location (county and census tract of the property instead of just state),
qualifying income for the borrower at origination and
borrower demographic data such as age range, race/ethnicity and gender of the borrower.

Building a robust dataset of the country’s mortgages, Part 3: So happy together

You can link the datasets together; and I have.

Impressive feat John, but HMDA and MBS disclosure data have coexisted for a couple of years now, surely you aren’t the first to have done this. Well, when I started this project in September, I thought I might be, but since then two commercial providers of MBS disclosure data have announced they are doing this linking as well. One in late September (https://www.recursionco.com/blog/the-big-picture-view-on-mortgage-delinquencies-update) and one earlier this month (https://www.ivolatility.com/news/3040). There may be more, but I don’t believe it is super widespread yet. Why?

The link is not easy to do. While the HMDA data and MBS data share a number of characteristics, both GNMA and the GSEs as well as HMDA privacy mask things like the loan amount (rounding it down to the nearest 1000 for example) but they do it in different ways and for other characteristics that serve as link points like LTV they don’t always match for the same loans straight up because one dataset includes the financed up front mortgage insurance or funding fee and the other does not. There is a lot of figuring out that must get done to get a good linking algorithm.
On top of that, there is a ton of normalizing work that has to be done to link the entity that originated the loan3 to the name of the seller/pool issuer provided in the MBS disclosure. And why is that important? because there is only so much variation that practically exists with the data points on which you can link. For example: there are a ton of loans that originated in 2021 with a 3% interest rate, most FHA loans originate with a 96.5 LTV etc. The privacy mask range on some attributes exacerbates this linking problem. An individual MBS loan that matches 2 or more HMDA records is a big problem that must be overcome and narrowing the search box to the HMDA records that say even a large originator like Rocket Mortgage filed vs. the whole population of HMDA records for a given year in a given state makes a huge difference in resolving the link for “multi-matchers”. How hard was this? There are literally thousands of entities that originated mortgages and filed HMDA records and then sold them to a GSE or into a GNMA pool. It took a huge amount of painstaking time to create a Rosetta stone of legal entities to all the various names and identifiers they have across the HMDA and MBS universes.

So just how many mortgages have you linked John? For loans currently in MBS pools (as of November) roughly 23 million of them spanning FHA, VA, FNMA and FHLMC. I have linked many more that have been removed from MBS pools (prepayment in some form) and some of the outstanding mortgages just can’t be linked because there is no HMDA record for them, or I cannot resolve a multi-match. It’s not everything but it’s well more than enough to see prepayment propensity signal in the HMDA part of the linked data.

Thats cool John and MBS traders/investors should be interested in that but what about this “blows privacy and refinance origination wide open” stuff you mentioned?

The Forbidden Link

When mortgages are recorded, they become part of public records. They need to be so that future purchasers of the subject property can understand who else might have a legal claim to the property. Pretty much every county makes these records available online. It is possible, albeit a truly monumental PITA to stitch all these together into a national dataset of mortgage recordings. Fortunately, there are commercial entities that have already done that work and for a bit of money (say 50-100k) you can have access to them.

The datapoints available in this dataset include, precise property address, interest rate, precise loan amount, mortgagee (aka originator/HMDA filer), and precise origination date. Linking either MBS disclosure data or HMDA to mortgage recording records independently for the most part would not work. HMDA is very vague on origination date (only granular to the year), and vague enough on loan amount (rounded up to the next 5000 or down to the next 5000) that almost everything would multi-match. Similarly with the MBS disclosure data, while the origination date is fairly precise (either exact in GNMA or highly probably origination month for GSEs), and the loan amount is more precise (within 1000 instead of 5000), the best you can do with locale is state level, so perhaps you could get some solid matches in a low origination year and tiny population state, but not enough to be useful. The linked HMDA and MBS data sets would crack this nut, the better specifics on date and amount from MBS along with county and census tract from HMDA would enable a relatively high level of matching. You would doubtlessly still have some multi-matchers but enough would match to be potentially extremely useful.

There’s just one massive problem. Doing so would specifically violate GNMA, FNMA and FHLMCs terms of data use. For example, GNMA’s terms

There are similar provisions in the GSE’s terms of use also. While possible and would be fairly easy to do once you’ve linked the MBS and HMDA data, it is forbidden.

The privacy implications are sort of obvious. Most folks don’t want their neighbors to know when they have fallen behind on their mortgage or entered into a Loss Mitigation workout plan. They bark really loud, but with all the data that is out there about individuals, I’m skeptical that if this link were allowed it would actually change much if anything.

The practical implications for the mortgage refinance origination industry are massive though. It totally upends the existing economics of mortgage refinance. Why’s that? Well let me digress for a bit and explain the fundamentals of refinance mortgage origination. It basically comes down to this. You have to get the prospective borrower on the phone first, before anyone else. If you can do that with a borrower for whom it is economically advantageous to refinance, you have a good shot at getting the origination and the revenue that goes with it. That’s basically it, Yeah yeah, you gotta be not completely disastrous with all the stuff that comes after first contact (underwriting, disclosures, capital markets etc.) but you can mostly suck with that stuff and if you are making first contact with homeowners who are in the money to refinance you’ll do great. Mortgage originators know this and accordingly spend enormous amounts of money on data/leads (LendingTree etc.) and to a lesser degree advertising to try and be first.

The easiest way to be first is if you are servicing the loan that will be refinanced. Indeed, a key reason, often the key reason, mortgage originators retain the servicing on loans they originate is so they can originate the next loan for that borrower. Successful originators will evaluate every loan in their servicing book every single night to see if it is now advantageous for the borrower to refinance and the day it becomes so, immediately give that borrower a call to discuss how they can refinance and save money each month. They can do this (and no one else can) because they have all the pertinent information from their servicing data to feed a mortgage product and pricing engine to see when it becomes advantageous.

Breaking refi economics: churning other folks servicing books

Having 20+ million MBS/HMDA/property records linked loans would give you most of the advantages of having the servicing book. The data isn’t quite as good as a servicer would have but it would be good enough for mostly accurate pricing engine output and identification of refinance leads. You wouldn’t have to pay for leads and you wouldn’t have to pay to acquire the servicing rights to the loan and the cherry on top is you wouldn’t lose the economic value in the servicing right that gets destroyed in the refinanced loan, that’s some other servicers loss. Simply put, you could churn other servicers servicing books. It would transform the economics of mortgage refinance.

I honestly seriously considered not writing any of that and only writing about the legit uses of this data that don’t violate GNMA/FNMA/FHLMC terms of use. I imagine they won’t appreciate what I wrote, and I have zero desire to invoke their ire. On the other hand, I think it’s important for them to realize that the linking of the HMDA and their MBS data is becoming more widespread and will eventually just be a norm and that all that really stands in the way of the further link to property records is their policy prohibiting it. Considering some of the characters in the mortgage industry that likely won’t be enough. The GSEs should be thinking about ways to monitor for such activity.

The Legit and Valuable Use of this Data

There is however a completely legit use of the linked datasets, predicting prepayment and default propensity of the loans within the MBS. Providing information to existing and potential MBS investors is the purpose for the MBS disclosure data to begin with. Linking in the additional datapoints in the HMDA data just lets you do it better. Why does better predictability of prepayment propensity matter to an MBS investor? Well MBS investors typically pay a premium for MBS because its coupon is higher than they could get on a treasury note/bond. Both assets are essentially credit risk free, but MBS carries significant prepayment risk. Almost all mortgages eventually prepay, it’s a very rare mortgage that goes the distance of its term. If that mortgage prepays quickly (and prevailing interest rates stay the same as they were when the mortgage was originated) the investor won’t have time to recoup their premium from the elevated coupon and will lose money. On the other hand, if the mortgage does not prepay for a long time (and prevailing interest rates stay the same) the investor earns the higher MBS coupon over what they could be earning in treasuries for a longer time recouping their premium paid and then some. Bottom line, assuming prevailing rates say the same, MBS investors don’t want the mortgages underlying the MBS to prepay.

On the other hand, sometimes MBS investors want the underlying mortgages to prepay. For example, if you were to buy today MBS with 2.5 coupons originated during the pandemic, you would like nothing more than for all of those mortgages to prepay. Why? because instead of a premium you paid a significant discount say 75 vs. 100 par. for that MBS, so if all the mortgages in the MBS quickly prepaid, you would receive 100 back on the 75 put in. For the most part of course, those mortgages don’t prepay much, they certainly don’t prepay due to an advantageous rates refinance. They rarely (though it does happen) prepay due to a cashout refinance, but man its painful for the borrower to give up that rate. They also occasionally prepay because well life (or death) happens. You get a divorce and have to sell the marital home, you move to take a new job and have to sell, you have a kid and want/need a bigger home in a different part of your locale, you die, and your estate has to sell your home. There are plenty of scenarios where life > rate, the home is sold/mortgage paid off and the MBS investor is delighted.

And that’s where the HMDA demographic data comes in, it turns out there’s definitely a signal in that data (race/ethnicity, sex, age) which is both interesting on a sociological level (I honestly had no idea what that signal would be or if there would even be a signal before running the queries to suss it out) and valuable on an MBS investing level. I’ll go into that in greater detail shortly.

But first a commercial break. ok it’s not actually a commercial break but more a what do I intend to do with this dataset. Since leaving my W-2 job in September, I have spent the majority of my time developing this linked dataset including the code to parse and load all the MBS and HMDA data into databases and the algorithms/code to link them altogether. I have pondered creating a company to sell this data and analysis from it, basically compete with the RecursionCo’s or IVolatility’s of the world. As lean as I run, Im certain I could undercut on price and do well. Honestly though, I’d rather spend more of my time/efforts thinking on/solving the next interesting problem (and writing about it in this substack) than operationalizing this one. Thus I am interested in selling/licensing the code required to pull in the MBS/HMDA data and link it together, or potentially partner with someone to operationalize it. My X DMs are open so reach out to me there @johncomiskey77 Or become a paid subscriber to this substack and leave a comment or message me on this platform.

Returning to the prepayment signal that arises from the HMDA data. And this is just one example, there are likely more waiting to be discovered.

I start with the set of FHA loans that were outstanding in January 2023 that had interest rates of 5.75 or lower. I stick with the lower rates to exclude the possibility of a genuine rate refinance. So, any of these loans that paid off since January 2023 was the result of a cashout refinance (albeit a painful one), a straight payoff (almost certainly due to the sale of the home securing the mortgage but I suppose a lottery winner might be in there as well). Delinquency and loss mitigation can also cause loans to be removed from MBS pools (an effective payoff for the MBS investor) but I filter them out. Next, I reduce the set of loans to only those loans that I can confidently link to their HMDA record. Then I compare the subset of those loans that paid off in the last 2 years through November (excluding Loss Mitigation and delinquency-based pool removals) grouped by demographic data to see if there is any significant difference in the prepayment propensity. Specifically, I grouped by HMDA derived race, ethnicity, primary applicant age, and HMDA derived sex.

Below are the somewhat surprising (at least to me) results.

Reverse Engineering Finance

Linking 23 million active mortgage records to borrower demographics

The remarkable privacy, mortgage refinance, and MBS prepayment implications

This post is for paid subscribers