The Math Behind the Poetry

ctpartin
Apr 16, 2019
12 min read

The goal of this project is to be able to objectively describe the structure created through repetition in The Sonnets. More specifically this means being able to explicitly show which sonnets are more connected through shared phrases and words, if there exists small clumps or groupings of sonnets, and cataloging all the various repetitions that create this structure. Now one could do this by hand, going through counting and sorting repetitions and then comparing all the sonnets by hand. Though due to the sheer amount of different repeated words and phrases and how many times they occur throughout the book, this task would take far too long for one person to reasonably accomplish. And there might even be patterns in the repetition that are too subtle for the human eye and mind to catch in a limited amount of time, therefore we turn to mathematics to assist us.

The first step in doing this mathematical analysis of The Sonnets is figuring out exactly what we want to acquire at the end. What we would like is a description of the relationships between sonnets, a way to discern which sonnets are more closely related and interlinked through the repetitions. What this sounds like is perhaps a way to understand the sonnets visually, plotted on a 2-d graph, where distance is our measure of similarity. Sonnets that are closer together on the plot will be more closely related than those that are further away. This will provide a visual way to view the structure as well as a quantifiable measure of similarity between sonnets. Though the question is, where do we get the numbers to plot the sonnets in the first place? What two numbers can you apply to a poem to measure similarity? That is where the main mathematical tool we will use comes in to play: Dimensionality Reduction.

Dimensionality Who?

Don’t worry, I realize at first that dimensionality reduction sounds like a horribly complex process, something out of most people's nightmares, but I hope to show that it is not only a fascinating technique, but a lot more intuitive than first meets the eye. The goal of dimensionality reduction is exactly as it sounds, to reduce the number of dimensions of whatever you are studying. Now already talking about higher dimensions may have some people feeling lost already, but the idea of higher dimensions in mathematics is incredibly useful and simple to use. It simply requires adding an extra number. For example, we think about ourselves living in 3 dimensional space, something that can be described by 3 coordinates. In math, to start talking about 4 dimensional space,we just need to add an extra number, an extra coordinate. So when we start talking about 76 dimensions later on, don’t worry about trying to wrap your head around what that visually looks like, just realize that all this means is that each point is just a list of 76 numbers. We are going to use some dimensionality reduction techniques to turn a 76 dimensional representation of The Sonnets, into a 2 dimensional one. Of course, we do lose some information when we make such a dramatic decrease in dimensions, but the goal is that we will be able to easily describe and visually see the relationships with this lower dimensional representation.

So the first question is, how exactly are we going to represent each sonnet as a long list of numbers? And beyond that, a list of meaningful numbers? Well, let’s give a much simpler example and show how we would turn this small collection into a list of numbers representing their shared words and lines. Suppose that we have the following small collection of poems:

apple cat cat dog apple dog

Riveting, I know. Now between these three poems we have three shared words: apple, cat, and dog. Imagine that for each poem we associate a list of three numbers, one for each word. Where each number will be a 1 if they have the word, and a 0 if they don’t. Thus we have the following three lists of numbers (the technical word would be a vector) corresponding to the three poems above:

[1 1 0] [0 1 1] [1 0 1]

The idea being that poems that are more similar will have a higher number of shared 1’s and 0's, and thus their “distance”, which we will define below (for those curious it is simply just the familiar Euclidean distance formula that you learned in middle school or high school), will be smaller. Now this is a relatively simple example, and when doing this for The Sonnets, there is a much larger selection of repeated elements that has to be accounted for. In total for the 79 sonnets, I cataloged 76 shared words and phrases, which even then is not a complete cataloging of all the repetitions in The Sonnets.

Okay, so we have a list of numbers for each sonnet, but these are way too long to actually work with. So how are we going to reduce these lists into smaller lists that still capture the idea that sonnets that are similar have smaller distances between them? Well lets go back to our simple example to get a feel of how dimensionality reduction might work. In our simple example, each poem was represented by a point in 3 dimensional space (since each one is just a list of 3 numbers). Imagine that we want to try and reduce this to only 2 dimensions, how might one go about this? Suppose that instead we had a sphere sitting on top of a 2-dimensional plane. How could we create a representation of this sphere onto the plane it is lying on?

Well, imagine we had a large powerful flashlight, larger than the sphere, and we shine it onto the sphere. What will happen? Well we see a shadow of the sphere projected down onto the plane.

ree — Professional drawing of the sphere and flashlight example

In a way, we have created a 2-dimensional representation of our 3-dimensional object. We can think about trying to find the most optimal angle of our flashlight that best preserves the integrity of the shape or whatever properties of it we are trying to capture. And this analogy translates to our higher dimensional case. While we can’t imagine what 76 dimensions look like, we can imagine our key problem here as trying to find the best 2-dimensional shadow of this data, the one that allows us to visualize and compare these mathematical representations of the sonnets. Though there is a problem, and that is, how do we find this optimal shadow? And how do we know it is optimal? Well to answer the second question, there might not be one best shadow. Different shadows might help display different aspects of our data. So we will be using two different dimensionality reduction techniques to find two shadows that will help us explore two important aspects of The Sonnets.

Method 1: Principal Component Analysis:

The first method we will use is a standard dimensionality reduction technique used in statistics, and will help us in capturing the outliers of our data. That is, the method of Principal Component Analysis (PCA). The idea behind PCA is this, for each of our sonnets we have a list of 76 characteristics that describe each one, namely the repeated elements of each sonnet. What we want is to find a smaller number of characteristics, in this case 2, that best summarize our data. Now it isn't that we are merely picking 2 of the characteristics (the repeated elements), instead we are trying to find new characteristics, combinations of the original ones, that best summarize the data. By best summarize, in this case we mean the characteristics that differ the most over all the sonnets. We want to find characteristics that allow us to distinguish sonnets from each other, and therefore find characteristics that greatly differ from one sonnet to the next. In technical language we want to find the two characteristics that have the greatest variance. In order to gain some more intuition and look at how the process explicitly works, let's look back at our example from earlier.

Recall the three list of numbers, the vectors, from above that we used to represent our three poems:

apple cat: [1 1 0]

cat dog: [0 1 1]

apple dog: [1 0 1]

We can look at how this visually looks by plotting each point in 3-dimensional space:

Now we somehow want to find the best angle to slice this 3-dimensional representation to get a 2-dimensional representation that keeps the integrity of the shape of the data. Now in order to do this we first will need to organize our data in a specific way, a matrix. For those that don't know, a matrix is simply a grid of numbers, with a certain number of rows and columns. It is frequently used in data analysis to be able to perform operations on large amounts of data. In our example, we will only need a 3 x 3 matrix, that is the matrix on the far left in the image below:

ree — Our matrix of data used to perform PCA

This matrix is simply just the 3 poem vectors stacked on top of each other. Now in order to do PCA, we need to make one simple adjustments that will not change the meaning of the data but are necessary for the calculation. We just subtract the average of each column (in this case 2/3) from each number in the column. This just allows our data to have an average of 0, which makes the PCA algorithm much easier to work with.

The next two steps are going to be a bit hand-wavy, but I will try my best to give an intuition on what is going on. First we find the covariance matrix associated with this matrix. Intuitively, the covariance matrix measures the amount of variability between the different columns of a matrix. In our case, this essentially means quantitatively measuring how similar two poems (represented by a column in the matrix) are. A positive covariance means that two poems more often have similar values, while a negative covariance means more often that two poems have different values. Once we have obtained this covariance matrix, our next step is to find the eigenvectors of the matrix. This may sound intimidating, but it is actually quite an intuitive idea. The intuition here is best illustrated through a picture. Take a look at the following examples of a dataset on a plane, along with their corresponding covariance matrices:

Now take a look at their illustrated eigenvectors:

Can you see what's going on here? The green and purple lines drawn through each scatter plot illustrate the eigenvectors of the covariance matrix. They are simply the directions along which the data varies the most! And this is exactly what we need in order to make our lower dimensional version of our poem data. So applying this to our example above, we now calculate the eigenvectors of the covariance matrix:

So what this gives us is two lines in 3 dimensional space that we use as the plane to project our 3 dimensional data onto! We can even see exactly what this these eigenvectors are telling us. Each entry is giving a value of "importance" to the different repeated elements. Such as the bottom right entry in the matrix above, saying that for the second eigenvector, the entry for "dog" has significance (as it was already accounted for in the first eigenvector). To get our final set of 2-dimensional coordinates for our poems, we simply multiply the transpose of the eigenvectors times our original matrix, and we will get our final set of 2 dimensional coordinates:

When we plot these we get this scatter plot:

If we compare this 2-d plot to the 3 dimensional one above, we can see how well the integrity of the shape is kept, and how the relationships between the poems (the distances) are respected in this 2 dimensional projection.

In order to get a 2 dimensional representation for The Sonnets, this exact process was copied only with the extra fact that we were using 79 poems and 76 different repeated elements. Thus the matrices are much larger and the reduction in dimension is from 76 to 2. Due to the high dimensional nature of the data, there are two things we need to point out. It is impossible to visualize the original data set, and due to large reduction in size, there will be some information that is lost in the process. Though through the process we are able to obtain a scatter plot representation of The Sonnets:

To dive into what this plot brings in terms of literary information about The Sonnets, visit the other post where we explore the literary side of things. There is also a link on the resources page that will provide you with code for an interactive version where you can see where the sonnets lie on the plot. If you want to learn more about the specifics of PCA and how it works, there will be links posted on the resources page to more in depth and mathematical explanations, as well as how to implement this in code.

Before moving on to the next section though, it is important to note a major limitation of PCA: it doesn't always conserve small distances between close points in the higher dimensional representation. What this means for us is that some distances may be stretched in the 2-dimensional representation, making some sonnets not appear as closely related as they might be. Thus for our analysis, we will primarily turn to the PCA representation as a way to measure difference in sonnets, looking at which sonnets are further than most to get outliers. Though this seems to present a large problem, if we can't be sure that PCA has maintained distances between close points, how can we properly tell when two sonnets are very closely related? Well do not fear, because that is where our next method in dimensionality reduction comes into play.

Method 2: t-Distributed Stochastic Neighbor Embedding

Already from the name, our second method might sound quite intimidating. To be honest, it is a good deal more complex than PCA, and we won't be able to go into as much detail as we did with PCA. Though due to this added complexity we also get the added bonus that distances between points are maintained better than they were in PCA. Thus we can use the representation from t-Distributed Stochastic Neighbor Embedding (t-SNE) to understand the similarities between the sonnets, and the relationships they form between themselves based on their shared repetitions.

The general idea behind t-SNE isn't too hard to understand, but the details require a good bit of background in probability theory as well as linear algebra which we don't have the space to build up here. Links will be placed on the resources page to resources for those that want to try and understand the details. The basic outline of t-SNE looks like this: First, we take our original high dimensional data set (in this case our 76-dimensional vectors representing the repetitions in the sonnets), and we calculate the distances between the points in high dimensional space. To calculate distance between two vectors, you simply subtract corresponding entries (i.e. subtract the first number of one vector from the first number of the second vector, the second number, etc. ) and then square and add the results. Taking the square root of this resulting number is the distance between the two vectors, which many of you will recognize as the familiar distance formula you learned in middle or high school. We then take these distances and arrange them in a matrix similar to above. After creating this distance matrix we randomly project the points onto a 2-d plane and do the same thing, creating a distance matrix for the 2-d points (with a slight variation which we will not discuss here). Then the goal of the t-SNE algorithm is to make the 2-d distance matrix look like the original high dimensional distance matrix, which it does by making small changes to the position of points one step at a time.

While this method is not as easy to see and the mathematics behind it are much more complex, it isn't hard to understand why it will maintain distance better than PCA. The entire goal of the algorithm is to have the distances between the lower dimensional representation to match the distances of the high dimensional data. To see what this method results in, here's the two dimensional plot we get for The Sonnets using t-SNE:

ree — t-SNE Plot of The Sonnets Note: Colors just for visual aesthetic

Like the last plot, we will dive into what this means for The Sonnets in the post on the literary side of the analysis. We can already appreciate though how the distribution seems to be much more even than the PCA one, suggesting that there is new information to be gained through this plot that we couldn't gain in the PCA plot.

The Code Behind the Analysis

In this last section I wanted to take a look at the code I wrote in order to analyze the 2-dimensional gained from the PCA and t-SNE analysis. All the code was written in python and there will be a link to a file containing it all on the resources page.

The code above was used to analyze the data, and the results will be discussed in greater detail on the literary analysis blog post. Here I just wanted to run through it and give a brief description of what each method does, in case someone wants to try to run the code themselves.

euclidean_distance: This method simply find the euclidean distance between two 2-dimensional vectors

get_neighbors: For each sonnet, this method finds the sonnets closest to it, with the number of neighbors controlled by the NUM_OF_NEIGHBORS parameter at the top of the file

getNextNeighbors: This method finds "next-door" neighbors, meaning pairs of sonnets that are both considered neighbors of each other. Thus these are sonnets that are close in the 2-d representations.

getAntiNeighbors: This method does the opposite of get_neighbors, namely for each sonnet it finds the sonnets that are furthest away.

getOutliers: Using the result of getAntiNeighbors, this method finds which sonnets are on the list of anti-neighbors for the most sonnets. Thus finding which sonnets are furthest from the majority of the sonnets, and finding outliers.

findCycles: This method finds what I'm calling cycles, a term taken from a branch of mathematics called graph theory. The idea is that it finds paths of nearby sonnets jumping from one to another in a random fashion but always staying close. It does this for thousands or possibly millions of iterations, ultimately showing which groups of sonnets are the most related based on the data it is presented.

The Premise