Class StandardsText

java.lang.Object
org.apache.tika.sax.StandardsText

public class StandardsText extends Object
StandardText relies on regular expressions to extract standard references from text.

This class helps to find the standard references from text by performing the following steps:

  1. searches for headers;
  2. searches for patterns that are supposed to be standard references (basically, every string mostly composed of uppercase letters followed by an alphanumeric characters);
  3. each potential standard reference starts with score equal to 0.25;
  4. increases by 0.25 the score of references which include the name of a known standard organization (StandardOrganizations);
  5. increases by 0.25 the score of references which include the word Publication or Standard;
  6. increases by 0.25 the score of references which have been found within "Applicable Documents" and equivalent sections;
  7. returns the standard references along with scores.

  • Constructor Details

    • StandardsText

      public StandardsText()
  • Method Details

    • extractStandardReferences

      public static ArrayList<StandardReference> extractStandardReferences(String text, double threshold)
      Extracts the standard references found within the given text.
      Parameters:
      text - the text from which the standard references are extracted.
      threshold - the lower bound limit to be used in order to select only the standard references with score greater than or equal to the threshold. For instance, using a threshold of 0.75 means that only the patterns with score greater than or equal to 0.75 will be returned.
      Returns:
      the list of standard references extracted from the given text.