Genome mapping the knowledge domain

A short note on vector embedding, its caveats and potential creative usage.

Feb 24, 2025

Map of Earth uses coordinates (latitude and longitude).

Knowledge map’s coordinates are dimensional vectors.

Any knowledge area can be tokenized into thousands of dimensional vectors to capture the narratives in that knowledge domain. As long as you have the adequate volume of data to fill the number of dimensions you choose to embed. You must carefully choose the number of dimensions.

Higher dimensions, such as 3000+ can capture intricacies of your domain very well, but if there are not enough data points, you would suffer from the curse of dimensionality. In other words, there will be too many vectors having the same properties.

This is akin to taking multiple photos with a 61MP camera in the dark. The photos will look the same. Each photo could go up to 129MB in size, but not enough information would be captured, regardless of how many megapixels the camera is capturing.

Caveats:

Temporal aspects of knowledge are often lost in static embeddings, however, you can try to capture that using one-hot encoding
Compute complexity increases with the dimensions
Once you decide on a number for dimensions, it is non-trivial to change it in the case of shift in the knowledge domain

Creative ideas:

If you do not have large quantity of data, enough to differentiate the vectors, consider combining with other datasets and increasing the dimensionality
Use the curse of dimensionality as anomaly detection mechanism. If most vectors are the same, the outliners can be detected easily
Tools you can explore to manage/reduce dimensionality: PCA, t-SNE, UMAP

Candidates:

Medical history, insurance claims, customer care conversations, etc.

ai-driven development

Discussion about this post