Tuesday, January 31, 2023

Understanding the ngram

 Ngrams are complicated, especially for those without a formal understanding of computer data objects. I wanted to post about ngrams and why they're special, interesting, and for as simple as they are, so powerful.

How do we store data in a computer? We have variables for that. However, assigning one name to every variable makes it clunky to recall. For that reason, we want to organize the data into chunks. One of the very earliest and simplest data types is the linked list.

The linked list is a 1-dimensional line of variables connected to each other. The list doesn't actually have to "exist" as its own separate object. Each of the elements of the list point to other elements in the list. So, the first object points to the second object and the second object points to the third and back to the first object. So, all you need to start a list is a first object. 

I have a list:

  1. Hieronymus 
  2. Sofonisba
  3. Grant 
  4. Blastonymus
Instead of numbers, which you don't have, you have something like this:

Hieronymus-Sofonisba-Grant-Blastonymus

So, if we want to get to Blastonymus, we must first pass Hieronymus, Sofonisba, and Grant, taking three steps. We can also move back on the list by calling the pointers that link the objects to one another. The most important thing to understand here is that you can only move back or forth. That means that you're in a one-dimensional space or on a line. That's the most important thing. Imagine yourself as a little car driving down the street trying to get to your friend Blastonymus's house but you're currently hanging out with Hieronymus. You must now drive three houses down. Your entire universe is those four houses.

Binary trees

The garden of forking paths is your binary tree. Every time you make a choice, you're presented with another one until the space ends. The sense of organizing information like this is that it allows you to issue directions in either left- or right-hand turns. This space still makes sense to us. Even when we're presented with three or four choices. We can still move back and forth in a way that doesn't entirely mess with our sense of direction. Once we get to ngrams that will stop.

Ngrams

Linked lists are one-dimensional data objects. You can only go back and forth. Binary trees are two-dimensional data objects because they bifurcate and offer more paths. Ngrams are n-dimensional data objects and they cause problems.

Imagine that you're at Hieronymus's house and you go to Blastonymus's house. So, now you're there, but gosh, you forgot your important object. So, now you must go back. The road you took is not the same road you need to take back. In fact, that road is gone. Worse, there may be no way to get from Blastonymus' house to Hieronymus's house. 

Further, instead of one direction, you have an indefinite number of dimensions that are continually added as we accrue ngrams. While it's certainly possible you could find your way back, it's also possible that way does not exist. 

With a linked list or a binary tree, you can always work your way back. With ngrams, you can end up on an island in a universe with no exits. 

Demonstration

"The first dog demanded that the second dog be more of a dog."

This is our sentence. An ngram algorithm will take this sentence and return the following data:

  • The->first(1), second(1)
  • Dog->demanded(1), be(1), .(1)
  • First->dog(1)
  • Demanded->that(1)
  • ...
In this case, the data for the ngram "The" will produce the links "first" and "second". The numbers represent the weight, or how many times the two terms come in sequence. Once you go from "The" to "first", there's no way back to "The". The ground collapses behind you. But that's not the problem spatially.

Even if you could get back from "first" to "The" the same roads won't be awaiting you.

So, let's say we drive from Hieronymus' house to Sofonisba's house. Sofonisba's house could be connected to Hieronymus's house or hundreds even thousands of other houses that suddenly appear once we're at Sofonisba's. So, even if there's a link back to our house, it's one of the thousands of others that lead elsewhere.

The ngram behaves like a multi-dimensional spatial object akin to the Hypercube.

No comments:

Post a Comment