A method for calculating pattern relevance in a text

This article aims at presenting a method for computing the relevance of a given string (pattern) in a text. This algorithm is at the core of my WordPress plugin Smart Tag Insert.

First of all, there is a difference between a simple pattern matching and computing text pattern relevance. The question we are trying to address here is the following: I have a string, and I would like to know how much that string is relevant for a specific text. For example, let’s say we haveĀ “download music” as the string of which relevance we are interested into. How can we determine how much relevant it is for a specific article?

The simple approach

The easy thing one could try is run a pattern match of “download music” in the article text. That is okay, but suppose the article contained strings like “download the music”, or “download some music”, or “downloading music”, or “download good quality music”. It is clear that, to a human, all these strings are equivalent when trying to understand what the article is about: it is about downloading music, regardless of whether it is good, bad, a lot or little.

A simple pattern match would fall short, because it would exclude all thoseĀ other strings and make it look like the content is not very much about downloading music, just because “download music” was never found exactly that way.

So the first point we need to acknowledge if we want to try to teach a machine to compute text pattern relevance, is that we need to find a way, at least a rough way, to teach it to grasp the meaning of the content.

Continue reading