Implementing OpenRoaming with WBA

Quick version for anyone who isn’t a Wireless Network Engineer:

OpenRoaming is roaming for Wi-Fi, like you do with your Cell. You know how your cellphone hops between cell towers without you logging in each time? OpenRoaming does that for Wi-Fi. Once your device is enrolled, it walks into any participating venue like a coffee shop, an airport, my house and connects automatically, securely, no captive portal, no password. No “Accept the terms” page. It just works.

There are a few flavours of this. eduroam for education (the name gives it away), govroam for government (100 points if you guessed it), and the one I’m testing this week: OpenRoaming from the WBA (Wireless Broadband Alliance), an ever-expanding club of networks and identity providers that agree to trust each other. Join the club, and your users can roam onto any member network securely.

To make your network WBA-ready, you contact the WBA and they hand you a set of certificates and keys (after giving some money). You build those into a “key-chain” on your gear, and from then on the happy WBA users can authenticate and roam through your Wi-Fi. Bob's your uncle.

My Story:

So OpenRoaming is a beautiful thing when it works. The client associates and is online before the phone’s out of your pocket. But when I tried it in my home lab, it didn’t work: the client associated, sat there, and never authenticated, with roughly zero useful error anywhere. My phone just said “No Connection” against my ‘perfectly’ configured WBA SSID.

I like to think of RadSec (Secure RADIUS) as a phone call to the WBA federation. You dial, and you get one of three things: a wrong number, a busy signal, or someone picks up but doesn’t speak your language. But then again when listening in from across the room, all three sound the same its just “nobody answered.” The whole job is figuring out what the issue actually is.

Here’s how I worked it on an ExtremeCloud IQ Controller (XCC), one layer at a time.

The symptom:

The client associates to the OpenRoaming SSID, then stalls. On the AP, show station tells the whole sad story:

State = authenticating PMK = 0000* PTK = 0000* IP = 0.0.0.0

No master key, no handshake, no DHCP. 802.1X starts and dies. That’s all the GUI and CLI give you, which is exactly why you reach for a capture.

Layer 1: the air — where does EAP give up?

Grab an over-the-air capture and ask one question: does the client even get offered a way to log in (an “EAP method”)? In my trace, every single frame was the same:

12 × EAP-Request / Identity 12 × EAP-Response / Identity

So EAP frozen at Identity means the authentication server never answered. The break is on the wired side. The client and the radio are fine, so stop looking at the air.

Layer 2: the transport — is RadSec even connecting?

RadSec is RADIUS wrapped in TLS on TCP port 2083, so capture there. And here’s the mental model that saves you hours:

TLS problems and RADIUS problems live on different floors. “Unknown CA” is a TLS thing. Realm routing is a RADIUS thing, and RADIUS only happens after TLS finishes shaking hands. You cannot have a realm problem on a handshake that never completed.

My first wired captures showed nothing at all on 2083. The controller never opened a socket. Why? The RadSec server had no destination to dial, so no IP. On the cloud side it had been auto-generated as 0.0.0.0, read-only, un-fixable, which is why I was on the controller in the first place. No target, no connection, no answer and one floor up, that’s your Identity loop.

The certificate trap (a.k.a. the part everyone gets wrong)

When RadSec misbehaves, the entire industry (also me) points at the certificates. I verified mine four separate times. They were correct every time:

Does the key actually match the cert? (the two hashes must be identical)

openssl pkey -in key.pem -pubout -outform DER | openssl sha256
openssl x509 -in cert.pem -noout -pubkey | openssl pkey -pubin -outform DER | openssl sha256

Does the chain verify to the OpenRoaming root?

openssl verify -CAfile chain.pem cert.pem

My chain was complete: my cert was signed by WBA’s CA-5, which was signed by the OpenRoaming Policy CA, which in turn was signed by the self-signed openroaming.org root. An unbroken ladder up to the anchor everyone in the federation trusts, and nothing expired. The PKI was never the bug. Write that on your hand, because when RadSec breaks your first instinct will be to blame the certificates, and you’ll waste an afternoon.

A certificate doesn’t vouch for itself. It’s signed by the one above it, all the way up to a root everyone already trusts. “Complete and clean” just means every rung is present, correctly signed, and nothing’s expired, so the climb succeeds. If yours does that, the certs are fine. Look somewhere else.

The XCC certificate bundle dialog. Three slots: CA, server cert, key. Get the right files into the right slots and the chain above is what you’ve built.

Layer 3: are you even talking to the right Radsec server?

Once I pinned a static RadSec target and the controller finally connected, TLS still failed: a fatal “Unknown CA,” sent by my own controller. One command explained the entire mystery:

openssl s_client -connect <hub-ip>:2083 </dev/null 2>/dev/null | openssl x509 -noout -issuer -subject

The money shot. My controller looked at the hub’s certificate and said “I don’t know who signed this,” and it was right.

In my case, The server replied with this identity:
issuer= CN=QA ID Manager CA, O=Aerohive Networks
subject= CN=authgateway, OU=Engineering

My “production hub” was an internal QA box. My controller did exactly the right thing and refused to trust a server cert from a CA that has nothing to do with OpenRoaming. The Unknown CA was the truth, not a bug. Yes, I’d been beating up my certificates over a server that was never WBA to begin with. Just testing (ahum).

Point it at the actual federation hub and the same command gives you what you want:
issuer= ...WBA WRIX ECC Intermediate CA-5...

If I could tattoo one command onto every RadSec/Wireless engineer, it’s that s_client issuer check**.** Before you debug anything else, confirm the thing on the far end of 2083 is a real WBA hub. I lost more time to a wrong IP than to any actual misconfiguration, and I’m not proud of it.

Left: a box wearing the wrong badge. Right: an actual WBA hub. Same command, two very different afternoons.

So, what was the root cause?

Honestly? I’d been at it for several days, so I’m not going to hand you a tidy single root cause, I think the journey is more useful in case you hit this yourself. Across the whole saga I’d changed (and learned) a bunch of things:

The destination hub,the client profile, the WBAID, the management platform.

I believe the setup was actually working fine after I switched to a different (correct) OpenRoaming client on my phone and made sure that I was using the correct Certs with the recommended settings..

That’s the real trap. When everything is broken at once, I start to create all sorts of variables and hope it works now. Which tells you nothing, because you changed five things and have no idea which one mattered.

Change one variable at a time, or you learn nothing. The method is the part that actually holds up, whichever knob turned out to be last:

Air / EAP. Does EAP advance past Identity? No means the server isn’t answering.
TLS / 2083. Does the handshake complete? No means wrong or missing destination, or Unknown CA.
Server identity. Run the s_client issuer check. Is it actually a WBA hub or something else?
RADIUS, realm & Operator-Name. Does the federation recognise your ANP and route the realm? That’s a broker question, not a checkbox.
Client. Does the device hold a real, routable OpenRoaming credential?

Work them in order and let each capture drop you exactly one floor. The loudest suspect, the certificates, was innocent the whole time. Where I actually went wrong was assuming the WBA client already on the device was a good one. Lesson: always retest with a client you know works before you blame the network.

Now the phone call connects fine. The trick is knowing whether you dialed the right number, reached the right desk, and got someone who speaks your language. And should you happen to walk by my house, you probably been on my OpenRoaming Network for a few minutes.

Implementing OpenRoaming with WBA

Layer 1: the air — where does EAP give up?

Layer 2: the transport — is RadSec even connecting?

The certificate trap (a.k.a. the part everyone gets wrong)

Does the chain verify to the OpenRoaming root?

Layer 3: are you even talking to the right Radsec server?

So, what was the root cause?

Comments

More from this blog

I Vibe Coded a PPSK Portal Before Lunch

Using the WLAN Pi M4+ in the Field

6 GHz in the EU: An Empty Highway With a Toll Booth

Command Palette

Layer 1: the air — where does EAP give up?

Layer 2: the transport — is RadSec even connecting?

The certificate trap (a.k.a. the part everyone gets wrong)

Does the chain verify to the OpenRoaming root?

Layer 3: are you even talking to the right Radsec server?

So, what was the root cause?

Comments

More from this blog