There's a certain irony in building infrastructure for AI-native teams and then realizing your own website is invisible to AI crawlers. We caught that issue head-on over the last two weeks, and the journey from "maybe worth taking a fresh look at something like Firecrawl" to a fully verified, AI-ready datum.net turned out to be more interesting than I expected.
Here's what happened, including especially the parts where we got it wrong.
The nudge
It started with a casual message from Jacob: "maybe worth taking a fresh look at Firecrawl." This is a very Jacob thing to drop in Slack: not quite a full project, definitely not top of the list sprint item, just a nudge that something might be worth revisiting.
For context: Firecrawl is a tool that converts websites into clean markdown. Not unlike Google's search crawler, LLMs, AI search engines, and agents prefer markdown when trying to understand what a site is all about. If you've ever asked Perplexity a question and wondered how it decided what to cite, Firecrawl-style crawling is a part of the answer.
I’d run a free-tier crawl on April 17th, so that’s where I started.
What the first audit said (and what we should have verified)
The initial results looked concerning. According to the crawl:
-
/features/,/essentials/,/locations/,/pricing/were all returning near-zero content -
~73% of our URLs had no meta descriptions
-
The blog index wasn't rendering any posts
We turned this into a brief for Claude Code and filed five GitHub issues on April 20th: blog SSR (#1223), meta descriptions (#1224), JSON-LD structured data (#1225), llms.txt cleanup (#1226), and nav deduplication (#1222).
Here's where it gets embarrassing.
We try our best to apply the “verify before asserting” principle whenever we’re working with AI. In this case, I shouldn’t have assumed the crawl result was grounded in total truth. It wasn’t until I opened all of the “please fix it” issues that we circled back.
Ends up, when we fetched the pages directly (live curl and web_fetch, not crawl outputs) every single page flagged as "not SSR'd" turned out to be fully server-rendered. The Firecrawl free tier just couldn't execute the JavaScript to confirm it.
Meta descriptions on blog posts, handbook pages, and events pages? Yup, all present and accounted for. Happily, issue #1224 was closed as Done without a single change needed.
In fact, three of our five issues evaporated on contact with reality.
To give credit where it is due, the free-tier limitation is documented: Firecrawl is upfront that the free plan doesn't execute JavaScript. We just didn't account for what that meant when interpreting "missing content" findings. Lesson noted!
What was actually broken
Two issues were genuine.
The blog index (#1223) really wasn't server-rendering its post listings. We confirmed it by running both tiers back to back: free tier returned zero posts, paid tier returned the full listing. The Strapi (a headless CMS we use to make publishing blog posts more enjoyable across the team) fetch was happening client-side. We fixed it by moving the fetch to build time, meaning that posts are now baked into the static HTML. Verified in the April 22nd recrawl. 🦞
JSON-LD structured data (#1225) was genuinely absent. No Organization schema, no BlogPosting, no BreadcrumbList, which AI search systems use to understand entity type and content relationships. Without it our website was treated as an unknown entity. So we added all three, pulling from Strapi frontmatter at build time for blog posts. Confirmed live. ✅
llms.txt (#1226) had two entries with the literal string "No description available" — one for the blog index and one for the brand/social page. It’s a small thing, but that string would have appeared verbatim in AI context windows, which ain’t good. Fixed. 🔥
The Astro tax
Both of our real issues (and the nav problem we'll get to in a minute) trace back to the same root cause: the Astro framework’s partial hydration model. It’s powerful, but it puts the decision of when content exists in the HTML entirely in your hands.
Astro calls this the "islands architecture." The idea is that most of your page is static HTML rendered at build time, and only the interactive bits (e.g. a dropdown menu, a search box, a dynamic component) get shipped as JavaScript islands that hydrate in the browser. It's a genuinely good model that results in fast pages, less JavaScript, and better performance.
The catch is that scrapers and AI crawlers don't behave like browsers. A browser will wait for your JavaScript to execute and hydrate the component before a user sees anything. A crawler — whether it's Googlebot's first pass, Firecrawl's free tier, or an AI agent fetching your page — may read the raw HTML and stop there. If your content lives inside an island that only hydrates on client interaction, to that crawler the content simply doesn't exist.
Our blog index had drifted exactly into this trap. At some point the post listing became a client-side island fetching from Strapi. This was probably a reasonable choice at the time, given the dynamic nature of the content, but from a crawler's perspective, /blog/ was a page with a headline and nothing else. No posts, no titles, no excerpts. Yikes!
The fix wasn't "Astro is wrong." Astro has exactly the right primitive for this: fetch at build time, render static HTML, no island needed for content that doesn't change between deploys. The issue was that we'd made the wrong hydration choice for a component where the content needed to be crawler-visible.
If you're running Astro and you care about AI search visibility, the question to ask for every dynamic component is: does a crawler need to read this content? If yes, it belongs in the static build output, not behind a hydration boundary. That's not really a limitation of the framework, but something you need to think about when choosing partial hydration.
The fix that wasn't a fix
The nav duplication issue (#1222) is where this lesson really landed.
Every page on our website was emitting its full navigation twice: once for the desktop layout, once for a mobile variant. Both ended up in the static HTML. On the homepage, that meant roughly 70% of what a crawler saw was nav boilerplate. The actual content — our mission, product descriptions, and all the other good stuff that matters — was the remaining 30%.
The first fix felt clean: PR #1239 moved the mobile nav into a <template> element. In the browser DOM, <template> content is completely inert — it doesn't render, doesn't execute, is invisible. We called it done.
Then we ran Firecrawl again on April 23rd.
Still two nav blocks. Still 70% noise.
The <template> fix worked perfectly for browsers. For Firecrawl (and for Googlebot's static HTML pass, and for any tool that reads raw HTML without executing JavaScript ) <template> content is just text that is physically present in the source bytes. They read it anyway.
This is the same underlying issue as the blog index, just inverted. With the blog index, content we wanted crawlers to see wasn't in the HTML. With the mobile nav, content we didn't want crawlers to see was in the HTML. In both cases, the assumption was "the browser does the right thing, so we're fine" — and in both cases, the crawler doesn't know or care what the browser would have done.
So we filed #1242, superseding the original, and converted the mobile nav to client:only in Astro — meaning the markup is never emitted into the server-rendered HTML at all. It only exists after client-side hydration, so the crawler never sees it.
After that fix:
curl -sL https://www.datum.net/ | grep -c "datum.net/locations/"# 1One nav as expected, done “agent ready” this time.
What the whole thing cost
Thanks to Jacob’s nudge driven by his obvious background in early 2000’s SEO, we created five issues. Four of those we closed out with genuine fixes, and one was a false positive.
The effort occupied a good chunk of time on our tiny web team from April 17th to April 23rd — six days from first crawl to everything verified green. Now we have to keep it that way!
The paid Firecrawl tier was worth it for one reason: it let us confirm which findings from the free tier were real vs. artifacts. Without the comparison, we'd have spent time fixing SSR on pages that didn't need it.
Why any of this matters
If an AI agent asks "what does Datum Cloud do?" and the answer it gets is 70% nav links and 30% actual content, it's going to give a worse answer than if those ratios were flipped. Same for Perplexity deciding whether to cite us, or Google's AI Overviews summarizing what datum.net is.
We're building infrastructure for AI-native teams. The least we can do is make sure our own site reads cleanly to an AI.
The broader takeaway isn't really about Firecrawl, or Astro, or any specific tool. It's that "works in the browser" and "visible to crawlers" are two different contracts, and it's easy to conflate them, especially when your framework is designed to make the browser experience seamless. The partial hydration that makes Astro fast is the same thing that can make your content disappear from crawlers if you're not deliberate about it.
Verify the thing, in the context it actually matters.
We used Firecrawl for the crawls and Claude Code to drive the fixes. The issues are public in datum-cloud/datum.net if you want to see the full audit trail.
