A method for calculating pattern relevance in a text

Posted on Sat 06 February 2016 in algorithms, IT, programming

This article aims at presenting a method for computing the relevance of a given string (pattern) in a text. This algorithm is at the core of my WordPress plugin Smart Tag Insert.

First of all, there is a difference between a simple pattern matching and computing text pattern relevance. The question we are trying to address here is the following: I have a string, and I would like to know how much that string is relevant for a specific text. For example, let's say we have "download music" as the string of which relevance we are interested into. How can we determine how much relevant it is for a specific article?

The simple approach

The easy thing one could try is run a pattern match of "download music" in the article text. That is okay, but suppose the article contained strings like "download the music", or "download some music", or "downloading music", or "download good quality music". It is clear that, to a human, all these strings are equivalent when trying to understand what the article is about: it is about downloading music, regardless of whether it is good, bad, a lot or little.

A simple pattern match would fall short, because it would exclude all those other strings and make it look like the content is not very much about downloading music, just because "download music" was never found exactly that way.

So the first point we need to acknowledge if we want to try to teach a machine to compute text pattern relevance, is that we need to find a way, at least a rough way, to teach it to grasp the meaning of the content.

(One thing that should be clear from the beginning is that we are not going to teach a machine to understand what a text is about, but rather whether the subject of the text is the one we are suggesting (for example, from a list). We are not asking what a text is about, we are asking whether the given text is related to "download music", and how we can assign a number to that relevance.)

A more sophisticated approach

So we have a couple of things we need to think about:

  1. how do we run a gentle pattern match, so that "downloading music" gets recognized as well? However, we still don't want the match to be completely ineffective. If a text starts with "download" and ends with "music", and has 3k words in between, that match should value pretty low (or nothing). So the next problem is:
  2. how do we effectively quantify the value of a single match?

The gentle/flexible pattern match

Point one is rather technical. We can build a regex that does the text pattern match the way we want like this:

^\bdownload(.*?)\bmusic^i

People better at dealing with regular expressions than I am may notice that this regex does what we wished, but has a limitation: it will match fine "download some music" and even "downloading some musicality" (whatever it means), but it will miss "redownload music" and "download discomusic". The choice here is due to the fact that prefixes change word meanings more often than post-fixes do (for example, imagine a text about the "new york run": allowing prefixes as well as post-fixes would make "new york forerunner" seem related to the text, while it is not).

Anyway, in a human language our regex reads: match strings that

  • start with download. No exceptions. Don't allow anything before (this assures us a real match is starting);
  • at some point (even 1k chars away) have music. Again, not discomusic or any other prefix to music;
  • end after music;
  • do the above case-insensitively.

Of course, one can have regex for complex strings, however complex they may be. For example, if one wishes to match "download good music fast", the regex would be

^\bdownload(.*?)\bgood(.*?)\bmusic(.*?)\bfast^i

A good idea would be to discard matches with full stops in them, and we will do that while processing the matches.

Determining a relevance score

Once we find a match, we need to assign a score to it. The idea is that if "download music" is found exactly, then it Is a perfect match, if "download good and nice music" is found, then yes it should be a match, but we do not want it to have the same score as a perfect match, and if "downloading an antivirus is important so that your computer can't be hurt by fake music" is found, then it should have a pretty low score.

The idea is to rely on the number of characters that are between each word of the string to match. This is a rough method, but it is fine enough to grant decent relevance scores. Mathematically speaking, we want the relevance of a match to be a (strictly) decreasing function of the number of characters that stand in between of the match words.

The formula we are going to use can be schematized like this:

relevance of a match = relevance of a perfect match - decrease of relevance due to in between chars

This expression still has two unknown values. We thus have two questions to address now:

  1. What should the relevance of a perfect match be? My answer comes from the following (arguable) assumption: perfectly finding 10 times the match in a 360 words text means 100% relevance of the given string. This is actually kind of reasonable: if you consider that an average sentence can be of 30 words, then to find a perfect match in each sentence would need 12 matches in the whole text. We could use 12 instead of 10, but 12 relates to an unlikely scenario, and would probably make posts reach lower scores than their real ones. So 10 is my choice.
  2. How much should relevance decrease, given the number of chars in between? This is 100% empirical, based on my expectations on test texts, but it looked like \(x^\frac{4}{5}\) is fine. It plays well on small numbers (so that "download the music" has still quite high a relevance) but gets bigger as soon as numbers grow a little bit, so that it drops the relevance of matches with words too far apart.

Share your knowledge!

Now, I am sure there are better ways to compute pattern relevance in a text. This article wants to be a starting point for methods of this kind, since I am quite confident that the ideas and the questions I have tried to address here do have some value. Getting to the root of the questions that we need to answer to solve a specific problem is more important than the answers one can provide, and are anyway fundamental in finding the right answers. So if you have any idea to improve this method, or have something completely different in mind, please do share it in the comments below, I would really like to know!

The complete algorithm for text pattern relevance

At this point, we have everything to build the algorithm that will calculate the relevance given a string (needle) and a text (haystack). What follows is written in PHP.

<?php
function get_tag_relevance( $needle, $haystack, &$overall_relevance ) {
   if( strlen( $haystack ) == 0 ) { $overall_relevance = 0; return; } //empty text -> 0 relevance for everything

    //Break the pattern in single words
   $exploded_needle = explode( " ", $needle );
   $explode_count = count( $exploded_needle );

   //Strip special chars from text and count text words
   $purged_haystack = preg_replace( '/[(),;:!?%#$"_+=\\/-]+/', '', trim( preg_replace( '/\'|&nbsp;|&#160;|\r|\n|\r\n|\s+/', ' ', strip_tags( $haystack ) ) ) ); //need to trim to remove final new lines
   $haystack_words = count( preg_split( '/\s+/', $purged_haystack, -1, PREG_SPLIT_NO_EMPTY ) );

   //Calculate how much a perfect occurrence should weigh
   $occurrence_weigth = 10*350/$haystack_words; //we consider perfectly finding 10 times the tag in a 360-words text as 100% relevance

   //Build the regex pattern
   $pattern = "^\b";
   $n = 1;
   foreach( $exploded_needle as $single ) {
      $pattern .= $single;

      if( $n < $explode_count )
      $pattern .= "(.*?)\b";

      ++$n;
   }

   $pattern .= "^i"; //case insensitive

   $output = array();
   $overall_relevance = 0;
   preg_match_all( $pattern, $haystack, $occurrences, PREG_PATTERN_ORDER );

   //preg_match_all( $pattern, $haystack, $occurrences_offsets, PREG_OFFSET_CAPTURE ); //this can be useful in computing relevance, maybe early matches should weigh more than later ones

   if( count( $occurrences[0] ) == 0 ) return 0;

   //If the string to match is just one word, its relevance is its weigth * times it occurres
   if( $explode_count == 1 ) {
      $overall_relevance = count( $occurrences[0] )*$occurrence_weigth;

   } else {
      $n = 0;
      foreach( $occurrences[1] as $single ) {
         //If pattern words are separated by a full stop, this shouldn't count as a match
         if( strpos( $single, '.' ) !== false ) continue;

         //Calculate how much this occurrence contributes to overall relevance
         $single = trim( $single );
         $relevance = $occurrence_weigth - ( pow( strlen( $single ), 4/5 ) );
         $output[] = array(
            "text_in_between" => $single,
            "relevance" => $relevance/*,
            "offset" => $occurrences_offsets[0][$n][1]*/
         );

         $overall_relevance += $relevance;
         ++$n;
      }
   }

   if( $overall_relevance > 100 )
   $overall_relevance = 100;

   return $output;
}