Embeddings as Ontology: What Protein Language Models Might Unlock
This is an AI slop essay to test the formatting of the website
Biology has always wanted a clean table of contents. We name protein families, define domains, draw pathways, and attach terms like “kinase activity” or “membrane component” as if the living world were a library that just needs better shelving. Those efforts matter. They also inevitably show their seams: the same protein can play different roles in different contexts, “families” blur at the edges, and new sequences arrive faster than curation can keep up.
In the last decade, machine learning offered a different kind of organizing principle: not a taxonomy with discrete bins, but a geometry where items live in a continuous space. Word embeddings made the point feel obvious in retrospect. Once you can represent words as vectors, “meaning” becomes something you can navigate: nearest neighbors, clusters, analogies, and smooth transitions instead of hard boundaries. CLIP did something similar for images, turning a chaotic universe of pixels into a space where “a wooden chair” is not just a label but a region you can move toward or away from.
Protein language model embeddings are starting to look like the same move—applied to molecules. The claim is not that embeddings replace biology, or that “proteins are language” is literally true. It’s that a learned vector space can act as an ontology tool: a pragmatic coordinate system for organizing, searching, and reasoning about proteins at scale. This post is about the how and why of that possibility—and the caveats that keep it honest.
What “ontology tool” means in practice
When people say “ontology,” they often mean a formal vocabulary: a set of terms plus relations (“is-a,” “part-of,” “regulates”) that makes knowledge computable. That’s the Gene Ontology mindset: precise identifiers, curated structure, explicit semantics. It’s powerful, and it’s brittle in the way all explicit schemas are brittle: reality keeps producing cases that don’t fit neatly, and humans have to keep patching the map.
An embedding space is almost the opposite. It is not explicit. It has no names built in. Its semantics are implicit: a point is “close” to other points because the model learned that closeness is useful for the training objective. The ontology emerges as geometry.
So “embeddings as ontology tools” doesn’t mean replacing curated ontologies with inscrutable vectors. It means using embeddings to do the things ontologies are often used for—organize, retrieve, relate, generalize—but in a way that can flex with the data. You keep the curated labels as anchors. You gain a continuous landscape between them.
The analogy: word2vec and the surprise of usable geometry
Word2vec didn’t succeed because it understood language like a human. It succeeded because a simple prediction task forced the model to compress distributional regularities into a low-dimensional space. The result was a surprisingly useful geometry: synonyms cluster, topics form regions, and directions sometimes correspond to interpretable “axes” (tense, gender, geography, formality).
What changed wasn’t just accuracy on some benchmark. What changed was the workflow. Once words had coordinates, a huge family of operations became natural: semantic search, recommendation, deduplication, analogy exploration, and visualization. You could build tools that felt like “browsing meaning.”
CLIP extended the idea: learn a shared space where images and text can meet. That unlocked not only better retrieval (“show me images like this caption”) but a coherent, navigable ontology for images. The space had structure you could exploit even when labels were missing or noisy.
The key pattern is this: a representation becomes an ontology tool when it supports composable, general-purpose operations that are stable enough to build on. Protein embeddings are interesting because proteins are one of the rare biological objects with both massive unlabeled data (sequences) and strong, consistent constraints (evolution, structure, biophysics).
Why proteins are unusually “embed-able”
A protein sequence is not a random string. Every residue sits inside a web of constraints: it must fold, it must interact, it must survive selection pressures, it must be expressible, it must be compatible with cellular context. Evolution edits sequences by small moves, and the viable moves are biased toward preserving function and fold. That means the space of natural proteins has shape—a manifold carved out of the astronomically large space of all possible sequences.
Language models are, in a sense, manifold learners. Train a model to predict missing tokens in a sequence, and it is incentivized to internalize whatever regularities make prediction easier: motifs, domains, long-range dependencies, and the statistical signatures of structural constraints. In proteins, those regularities are not just “grammar”; they are compressed reflections of biochemistry.
This is the deep reason protein embeddings might become ontology tools: the training objective is a proxy for the same hidden variables biology cares about—fold, function, interaction partners, localization, evolutionary history—because those hidden variables shape which sequences exist in the first place.
A useful metaphor, not a literal claim “Proteins are language” is easy to over-literalize. The real point is that both language and proteins produce huge corpora of discrete sequences generated by constrained processes. A model that learns to predict tokens can pick up on those constraints, and its internal states can become useful coordinates.
How an embedding becomes a protein “coordinate system”
Protein language models output different kinds of representations. Some are per-residue (a vector for each position in the sequence), which can highlight motifs or structural features. Some are pooled into a single vector for the whole protein—an embedding meant to summarize the sequence. Either can act as coordinates; the “whole-protein” vector is the simplest place to see the ontology-tool idea.
Once you have a vector for each protein, you can build operations that feel ontology-like:
- Similarity search: “Find proteins near this one.”
- Neighborhood labeling: “What annotations do nearby proteins share?”
- Clustering: “Partition the space into families, subfamilies, and outliers.”
- Interpolation: “What lives between these two functional regions?”
- Visualization: “Project the space and browse it like a map.”
The first time you see this working, it feels like cheating. You take a poorly annotated protein, drop it into the space, and its neighbors are suddenly informative: remote homologs, proteins with similar domain architecture, or enzymes that act on related substrates. Even when you already have alignment tools, the workflow is different: embeddings give you a fast, global, continuous notion of proximity.
What embeddings can capture that alignments struggle with
Sequence alignment is one of the great inventions of computational biology. But it has a particular bias: it privileges detectable residue-level correspondence. When two proteins are close enough that alignment sees them, life is good. When they’re far—remote homology, heavy divergence, domain shuffling, low-complexity regions—alignment becomes fragile or slow, and the “search radius” shrinks.
Embeddings offer a different bias. They can, at least in principle, represent proteins by distributed features that do not require clean residue-to-residue matches. If two proteins share a fold or functional motif that is expressed in multiple sequence patterns, an embedding space can place them closer even when a simple alignment is uncertain. It’s not magic; it’s compression. The model has learned a set of features that are useful for predicting residues across the corpus, and those features can sometimes function like a learned notion of “biological similarity.”
This matters for ontology-building because curated categories often rely on those deeper similarities. We don’t want “family” to mean “aligns well.” We want it to mean “shares an evolutionary/structural/functional identity” even when the surface evidence is messy.
From coordinates to ontology: labeling the space
A vector space by itself is mute. The reason word2vec felt meaningful is that we could poke it: nearest neighbors returned words we recognized, and analogies lined up with human concepts. For proteins, the equivalent is attaching biological meaning to regions of the space.
There are two complementary moves.
1) Use curated labels as anchors
Suppose you have a set of proteins with reliable annotations: domains, enzyme classes, subcellular localization, binding partners, phenotypes. Place them in embedding space and look at the geometry. Do proteins sharing a label cluster? Are there subclusters that suggest finer distinctions than the label currently encodes? Are there boundary regions where labels conflict, hinting at multifunctional proteins or annotation noise?
This is ontology work, but with a new instrument: you’re not just curating terms; you’re curating where those terms live in a continuous landscape.
2) Let the space propose structure you didn’t name yet
The second move is more exciting and more dangerous: use the embedding geometry to suggest new groupings, new relationships, and new hypotheses. Clusters can propose families. Bridges between clusters can suggest evolutionary transitions or convergent solutions. Outliers can highlight novel biology or broken sequences.
In practice, the best pattern is cyclical: embeddings propose structure → humans and experiments validate → validated structure becomes new anchors → the map improves. That’s how an implicit representation becomes a usable ontology tool without pretending to be a formal ontology.
Can protein embeddings do “analogy math” like words?
The seductive demo for word embeddings is the “king − man + woman ≈ queen” trope. Biology invites similar temptations: enzyme for substrate A minus substrate A plus substrate B equals enzyme for substrate B. Sometimes, in carefully curated settings, you can find directions in protein embedding space that correlate with properties: soluble ↔ membrane, bacterial ↔ eukaryotic, secreted ↔ cytosolic, and so on.
But proteins are trickier than words. Many properties are entangled: localization correlates with signal peptides, which correlates with domain composition, which correlates with organism, which correlates with codon usage, and so forth. The geometry is real, but “axes” are rarely clean.
The right mental model is not “vector arithmetic yields perfect semantic edits.” It’s “directions can be useful features,” especially when combined with supervision. You can often learn a simple linear probe that separates proteins with a property from those without, which tells you the property is represented in the space. That’s already ontology-like: a property becomes a region you can identify and navigate.
Why this could change protein discovery workflows
Ontologies aren’t just philosophical. They’re tools for everyday work: annotating a new genome, finding candidate enzymes, prioritizing proteins for structural determination, designing mutants, or searching for homologs across species. Embedding spaces can slot into these workflows in ways that feel more like “search” than “analysis.”
Fast, global retrieval
Instead of running many pairwise comparisons, you precompute embeddings and build an index. A query protein becomes a vector lookup: nearest neighbors in milliseconds. That changes the iteration loop: explore, refine, explore again. It becomes plausible to browse a proteome like you browse a photo library.
Annotation transfer with calibrated uncertainty
If a query protein sits inside a neighborhood where 90% of proteins share a domain annotation, you have a strong prior. If it sits at a boundary between neighborhoods, you can surface ambiguity instead of forcing a single label. A continuous space is naturally compatible with “soft” classification—exactly what biology often demands.
Finding the “near but not obvious”
Many discoveries come from near misses: proteins that are similar enough to hint at function, but different enough to do something new. In an embedding space, those are often the points that sit near a cluster but not inside it, or the points that bridge two regions. Those are good candidates for novelty: new substrate specificity, new regulatory modes, new domain combinations.
A scaffold for multimodal biology
CLIP’s magic was alignment: images and text share a space. Proteins invite even richer alignment: sequences, structures, ligands, reactions, phenotypes, and literature. A protein embedding space can act as a backbone that other modalities attach to. When that works, the “ontology” becomes not just protein families, but functional biology in context.
What has to be true for the ontology to be trustworthy
It’s easy to get carried away. A pretty UMAP plot can make any representation look meaningful. If protein embeddings are going to serve as ontology tools, they need properties that are boringly practical: stability, calibration, interpretability at the tool level, and known failure modes.
Distances must correlate with something you care about
“Close” is only useful if it is aligned with a downstream notion of similarity: shared fold, shared function, shared domain architecture, shared evolutionary origin—pick your target. And because these targets differ, there may not be one universal distance. The same embedding may support multiple ontologies depending on how you probe it.
The space must be robust across trivial confounders
If “nearby” mostly means “similar length,” your ontology is broken. If low-complexity regions dominate geometry, you’ll cluster junk. If organism-specific biases overwhelm functional signals, you’ll build a taxonomy of sampling artifacts. A usable embedding space needs either inherent robustness or straightforward normalization strategies.
Versioning must be taken seriously
Curated ontologies change slowly and explicitly. Embedding spaces can change abruptly when the model changes. If you build tools on top of embeddings, you need the equivalent of ontology versioning: model identifiers, reproducible embeddings, and explicit statements about what changed. Otherwise your “map” drifts and yesterday’s coordinates stop meaning the same thing.
“Ontology” should include uncertainty, not just labels
Biology is full of multifunctionality and context dependence. A protein can be “in” multiple categories, or behave differently depending on cell type and condition. A continuous space is an opportunity to represent that nuance: neighborhoods with mixed labels are not bugs; they’re signals. But the tooling has to surface them.
Where embeddings will mislead you
To treat embeddings as ontology tools without getting burned, it helps to name the failure modes plainly.
Correlation masquerading as meaning
A model trained on natural sequences learns natural biases. If certain functions are common in bacteria and rare in eukaryotes, “function” and “taxonomy” may entangle. If secreted proteins have signal peptides, “secreted” might dominate a region of space even when the rest of the sequence differs. You can end up with an ontology of easy cues.
Convergent function, divergent sequence
Convergence is a core biological phenomenon: different sequences can solve the same problem. Sometimes embeddings will bring them together; sometimes they won’t. If a function can be implemented by multiple unrelated folds, a sequence-only embedding may separate them. That’s not wrong; it’s a reminder that “ontology” depends on what information the space contains.
Unknown unknowns and out-of-distribution proteins
Metagenomes, synthetic proteins, extreme repeats, engineered tags—some sequences are simply outside the training distribution. The embedding will still produce a vector, and it will still return neighbors. The neighbors may be nonsense. Any embedding-based ontology needs guardrails: confidence metrics, anomaly detection, and human-in-the-loop checks.
False comfort from smoothness
Continuous spaces feel reassuring: everything is “somewhere,” everything has a neighbor. But biology contains discontinuities: a single residue can flip specificity, a domain swap can create a new function, a truncation can destroy activity. Smooth geometry can hide sharp functional cliffs. You need to treat the map as a guide, not a guarantee.
Turning the idea into a real tool: a practical recipe
If you wanted to build an “embedding ontology” system for proteins, you could do worse than this blueprint:
- Pick the representation deliberately. Decide whether you want whole-protein embeddings, per-residue embeddings, or both. Decide which layer(s) you will use and how you pool across residues.
- Define what “similarity” should mean for your use case. Is your ontology about fold, enzymatic function, localization, interaction networks, or something else? There is no default similarity that is universally correct.
- Calibrate with anchors. Use a trusted set of labeled proteins to test whether neighborhoods behave sensibly. Measure: do known families cluster? do domain boundaries appear? do confounders dominate?
- Build a graph, not just a search box. Nearest neighbors are great, but ontologies are relational. Construct a k-NN graph in embedding space and analyze its communities, bridges, and hubs.
- Attach uncertainty to every suggestion. Report neighborhood purity, distance distributions, and “is this protein an outlier?” signals. Make it easy to tell when the map is confident and when it is guessing.
- Close the loop with new labels and experiments. Let users correct the map. Feed validated corrections back into supervision or into curated anchors.
The point is not that embeddings magically solve ontology. The point is that they provide a substrate where ontology work becomes more interactive and more data-driven: you can explore the landscape, not just edit the dictionary.
The CLIP-shaped future: alignment between proteins and everything else
CLIP’s most transformative feature was alignment across modalities: a shared space where text and images can talk to each other. For proteins, the analogous dream is a shared space that aligns sequence with structure, function descriptions, reactions, ligands, and even phenotypes.
If you can embed a protein sequence and also embed a textual description like “serine protease involved in coagulation,” then you get something ontology-like almost for free: you can query proteins with language, retrieve candidates, and see where descriptions land in protein space. If you can embed ligands and reactions, you can query by chemistry. If you can embed phenotypes, you can connect genotype to organism-level outcomes.
That’s not a solved problem. It’s also not science fiction. The pattern—contrastive alignment across modalities—is known. The hard part is the biology: the labels are noisier, context matters more, and “ground truth” is often incomplete. Still, the shape of the opportunity is clear: an embedding space can become a hub where disparate biological data attach and become jointly searchable.
So, why call it an ontology at all?
Because ontologies are ultimately about making knowledge usable. They let you ask questions at scale: “What else is like this?” “What category does this belong to?” “What properties tend to co-occur?” “Where are the gaps?” “What should I study next?”
Protein language model embeddings can help answer those questions in a way that fits the modern data regime: millions of proteins, sparse labels, constant discovery, and biology that refuses to stay in crisp boxes. They offer a map that can be browsed, searched, and refined. They won’t replace curated ontologies—those are the names and definitions we use to communicate. But they can supply the geometry underneath: the continuous structure that curated terms gesture toward.
If word2vec made it natural to search among words by meaning, and CLIP made it natural to search among images by concept, then protein embeddings hint at a future where we can search among proteins by biology—not just by alignment, not just by labels, but by position in a learned landscape shaped by evolution.
The promise is not perfection. The promise is leverage: a new coordinate system that lets us navigate the protein universe with the same ease we now navigate text and images.
Footnotes
- When people say “protein language model,” they typically mean transformer-style models trained on massive protein sequence corpora with self-supervised objectives (masked-token prediction, next-token prediction, or variants). Different training choices change what the embedding space emphasizes.
- “Ontology tool” here is intentionally pragmatic: a representation that supports organizing and retrieving proteins in ways that are stable enough to build workflows on, even if the representation itself is not a formal symbolic ontology.