Getting started with OpenNLP (Natural Language Processing)

I found a great set of tools for natural language processing. The Java package includes a sentence detector, a tokenizer, a parts-of-speech (POS) tagger, and a treebank parser. It took me a little while to figure out where to start so I thought I'd post my findings here. I'm no linguist and I don't have previous experience with NLP, but hopefully this will help some one get setup with OpenNLP.

What do these tools do?

Given the input sentences below, we'll take a look at what these tools actually do.

This isn't the greatest example sentence in the world because I've seen better. Neither is this one. This one's not bad, though.

Sentence Detector

Straight-forward: it detects sentences. This is more complicated than it sounds, since sentences don't only end with periods and dialogue can also complicate things. Fortunately for us, all of this is handled by their code and we just grab the sentence tokens. Just like the examples in the README, you'll probably start most processing with this because the other tools deal with one sentence at a time.

The sentence detector returns an array of strings. In our example the first element would be:

This isn't the greatest example sentence in the world because I've seen better.

Tokenizer

The POS tagger and treebank parser both need to have sentences broken down into tokens separated by spaces. Tokens are usually words but I noticed that some words get split into multiple tokens. For example, "don't" gets split into "do" and "n't," after it's uncontracted form, "do not." Some punctuation also gets split into separate tokens. Here's what it does to our sentence:

This is n't the greatest example sentence in the world because I 've seen better .

Note that the "n't" has become a separate token. The same happened to the contracted "have" and the period has also become a token.

POS Tagger

The tagger uses a dictionary of tags and a trained model to apply parts of speech tags (verb, adverb, personal pronoun) to each token in a sentence. The tagging output conforms to the "Penn Treebank Style." Here's the result of tagging the tokenized sentence:

This/DT is/VBZ n't/RB the/DT greatest/JJS example/NN sentence/NN in/IN the/DT world/NN because/IN I/PRP 've/VBP seen/VBN better/RB ./.

As you can see, each token has been appended with a slash followed by a POS tag. I found this parts of speech reference useful for understanding the tags.

While the tagger can give you quite a bit of information about the sentence, it doesn't tell you very much about the sentence structure.

Treebank Chunker

The goes a little further in showing sentence structure by breaking the sentence into simple chunks. Noun phrases and verb phrases are recognized and tagged appropriately. Taking our example sentence, we get something like this:

[NP This/DT ] [VP is/VBZ ] n't/RB [NP the/DT greatest/JJS example/NN sentence/NN ] [PP in/IN ] [NP the/DT world/NN ] [SBAR because/IN ] [NP I/PRP ] [VP 've/VBP seen/VBN ] [ADVP better/RB ] ./.

This is pretty useful output. Although it doesn't provide as much information as the parser, it does load up a lot quicker and doesn't require as much memory.

Treebank Parser

This is the big kahuna. You can tell by the resources it consumes. The parser tags tokens and groups phrases into a hierarchy, building sentence trees of Parse objects. Each of the possible trees for the sentence are also given a probability which indicates the likelihood that this is the correct way to interpret the sentence. The parser uses the models in the parser models directory, takes about thirty seconds to start up on my machine, and ends up using around 300MB of memory. Once loaded, however, the actual text parsing happens pretty quickly.

Here's the tree generated for our example sentence:

(TOP (S (NP (DT This)) (VP (VBZ is) (RB n't) (NP (NP (DT the) (JJS greatest) (NN example) (NN sentence) ) (PP (IN in) (NP (DT the) (NN world) ) ) ) (SBAR (IN because) (S (NP (PRP I)) (VP (VBP 've) (VP (VBN seen) (ADVP (RB better)) ) ) ) ) ) (. .) ) )

Some code to get you going

String paragraph  = "...";
 
// the sentence detector and tokenizer constructors
// take paths to their respective models
SentenceDetectorME sdetector =
    new SentenceDetector("models/sentdetect/EnglishSD.bin.gz");
Tokenizer tokenizer = new Tokenizer("models/tokenize/EnglishTok.bin.gz");
 
// the parser takes the path to the parser models
// directory and a few other options
boolean useTagDict = true;
boolean useCaseInsensitiveTagDict = false;
int beamSize = ParserME.defaultBeamSize;
double advancePercentage = ParserME.defaultAdvancePercentage;
ParserME parser = TreebankParser.getParser(
        "models/parser", useTagDict, useCaseInsensitiveTagDict,
        beamSize, advancePercentage);
 
// break a paragraph into sentences
String[] sents = sdetector.sentDetect(para.toString());

Now we feed each of the sentences to the tokenizer, and pass the output to the parser.

String sent = sents[0];
 
// tokenize brackets and parentheses by putting a space on either side.
// this makes sure it doesn't get confused with output from the parser
sent = untokenizedParenPattern1.matcher(sent).replaceAll("$1 $2");
sent = untokenizedParenPattern2.matcher(sent).replaceAll("$1 $2");
 
// get the tokenizer to break apart the sentence
String[] tokens = tokenizer.tokenize(sent);
 
// build a string to parse as well as a list of tokens
StringBuffer sb = new StringBuffer();
List<String> tokenList = new ArrayList<String>();
for (int j = 0; j < tokens.length; j++)
{
    String tok = convertToken(tokens[j]);
    tokenList.add(tok);
    sb.append(tok).append(" ");
}
String text = sb.substring(0, sb.length() - 1).toString();

The parser takes a two-layer tree of Parse objects. The parent Parse object holds a list of child Parse objects—one for each token in the sentence.

// the parent parse instance spans the entire sentence
Parse p = new Parse(text, new Span(0, text.length()), "INC", 1, null);
 
// create a parse object for each token and add it to the parent
int start = 0;
for (Iterator ti = tokenList.iterator(); ti.hasNext();)
{
    String tok = (String) ti.next();
    p.insert(new Parse(text, new Span(start, start + tok.length()),
                       ParserME.TOK_NODE, 0));
    start += tok.length() + 1;
}
 
// fetch multiple possible parse trees
Parse[] parses = parser.parse(p,numParses);

At this point, parses holds the different possible parse trees for the sentence. Now, figure out what you want to do with it, you must!

Comments

Anonymous says, "java code formatting"
Anonymous's picture

Hi, nice post thanks. However the final java sample code is barely readable. Could you please add some <pre/> tag around it?

daniel says, "Whoops"
daniel's picture

Looks like this page lost some styling when I moved from Wordpress to Drupal. Looks much better now!

Anonymous says, "what is the defination of"
Anonymous's picture

what is the defination of untokenizedParenPattern1 and 2?

daniel says, "untokenizedParenPattern definitions"
daniel's picture

Hi Anon,
The definitions are as follows:

private static Pattern untokenizedParenPattern1 =
    Pattern.compile("([^ ])([({)}])");
private static Pattern untokenizedParenPattern2 =
    Pattern.compile("([({)}])([^ ])");

If I remember correctly, these patterns are used to distinguish parentheses that were part of the input string and those that denote tokens. Hope that helps.

Gunith says, "What does convertToken do?"
Gunith's picture

The function convertToken is not shown in the code given. Can you please include that function here? Thanks.

Btw, this page is REALLY useful as there is NO documentation on how to work with OpenNLP except the JavaDocs, which aren't helping much to a n00b like me.

Hoping for a quick response.
Gunith

daniel says, "convertToken function listing"
daniel's picture

Hi Gunith,

Here's the code for the convertToken method. This is also part of the code that protects the parentheses and braces from being interpreted as output from the parser.

private static String convertToken(String token)
{
	if (token.equals("(")) return "-LRB-";
	else if (token.equals(")")) return "-RRB-";
	else if (token.equals("{")) return "-LCB-";
	else if (token.equals("}")) return "-RCB-";
	return token;
}
ashwin ittoo says, "normalizing"
ashwin ittoo's picture

thanks for the tutorial
is there a way to obtain the canonical versions of tokens: e.g. USA and U.S.A --> usa

thanks

daniel123 says, "Using the chunker"
daniel123's picture

Would be great if you could include the piece of code that creates & calls the Chunker as well...
thanks so much,
/d

Teng-Kai Fan says, "Please Help me"
Teng-Kai Fan's picture

This document is very useful for me.
May I ask you some question I encounter as follows:
When I compiled the above code you shared, some errors happened

bad class file: D:\Tools-N\Java-API\opennlp-tools-1.3.0\lib\opennlp-tools-1.3.0.jar(opennlp/tools/sentdetect/SentenceDetectorME.class)
class file has wrong version 50.0, should be 49.0
Please remove or make sure it appears in the correct subdirectory of the classpath.
SentenceDetectorME sdetector = new SentenceDetector("D:\\Tools-N\\Machine Learning+NLP\\OpenNLP\\sentdect\\EnglishSD.bin.gz");

1. Could you share me your the version of opennlp-tools (1.3.0 or 1.0.0)
2. I got the model file about sentdect or tokenize (i.e., EnglishSD.bin.gz and EnglishTok.bin.gz), but I am not sure how where the correct location is. I put them on the absolute path like "SentenceDetectorME sdetector = new SentenceDetector("D:\\Tools-N\\Machine Learning+NLP\\OpenNLP\\sentdect\\EnglishSD.bin.gz")";
is it correct?

Sincerely yours

Sophia Lee says, "Wondering if you can release your source code?"
Sophia Lee's picture

Great job! However I met some problems placing the .gz files, could you release your source code and explain how to run it? thx!

daniel says, "Gz Files"
daniel's picture

Hi Sophia,

Unfortunately I haven't touched NLP for ages (this blog post is a year old) and I don't have time to put together a sample application. However, the .gz files in my project are in a structure like below. Hope that helps you!

  • models
    • chunker
      • EnglishChunk.bin.gz
      • namefind
      • date.bin.gz
      • location.bin.gz
      • money.bin.gzorganization.bin.gz
      • percentage.bin.gz
      • person.bin.gz
      • time.bin.gz
    • parser
      • build.bin.gz
      • check.bin.gz
      • chunk.bin.gz
      • dict.bin.gz
    • head_rules
      • tag.bin.gz
      • tagdict
    • sentdetect
      • EnglishSD.bin.gz
    • tokenize
      • EnglishTok.bin.gz
daniel says, "OpenNLP 1.3.0"
daniel's picture

Hi Teng-Kai Fan,

I've posted my version of the OpenNLP library. Not sure if it'll work for your setup so you may have to compile the library yourself. Either way, good luck.

In answer to your second question, I've posted the folder structure in the comment above. Best of luck with the project!

doron says, "Not in GZIP format"
doron's picture

great tutorial thx!!!!

i am having some problem runing the code, im new in java and having alot of problems. after runing i get this msg on the console:

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.(Unknown Source)
at java.util.zip.GZIPInputStream.(Unknown Source)
at opennlp.maxent.io.SuffixSensitiveGISModelReader.(SuffixSensitiveGISModelReader.java:50)
at opennlp.tools.lang.english.Tokenizer.(Tokenizer.java:36)
at test.Main.main(Main.java:24)

do you know what could be the problem...

Soana says, "Thx for your"
Soana's picture

Thx for your website.

Without this it would have been impossible to start with OpneNLP - you know i'm completely new to java.

Soana says, "How to run the parser"
Soana's picture

Hi Daniel, hi guys

As I posted yesterday I'm completely new to Java, so I have some basic problems running OpenNLP tools.
I'm working on a Windows machine.
I'm following step by step all the Daniel' tips and I also followed the Daniel' hint to build my dir tree.
Everything works well but I cannot run the Treebank Parser - I'm using Eclipse: when I run the tool I give one String argument to the TreebankParser, and I write down the path: models\parser.
In this case the console prints out:

""Exception in thread "main" java.io.FileNotFoundException: models\parser\tag.bin.gz (Impossibile trovare il file specificato)""

That means impossible to find the specified file.
At this point I copy the file tag.bin.gz in the madels\parser dir, I run again the tool and what I have is:

""Exception in thread "main" java.io.FileNotFoundException: models\parser\head_rules (Impossibile trovare il file specificato)""

And finally if I put the head_rules dir inside the models\parser dir , I run again the tool, it says me that the access to the folder is denied.

What should I do?

Thx

Soana says, "Problem solved"
Soana's picture

OK I can post and answer by myself

I solved the problem ;)

The only one different thing i made from Daniel is puting the file
tag.bin.gz also in the models/parser direcory, and creating one file called head_rules, without any extension, in the same directory: it works!

quek says, "Hey, Soana, you're supposed"
quek's picture

Hey,

Soana, you're supposed to download the models to be used in conjunction with the OpenNLP jar file. Just creating the files and not using the models won't be very helpful. Go to http://opennlp.sourceforge.net/models.html to check it out.

Soana says, "corpus for training data"
Soana's picture

Thx quek, anyway I already downloaded that models: it was just a problem of setting up directories.

Well, now I have a big challenge : I need to understand how to create models in Italian language, so I need to know which file format should the trainig data need in order to create a new model.
I am looking in a lot of websites but really I cannot find any tip!

So the first simple questions are:
1) Should I need an italian Corpus for training models?
2) If I need a Corpus, which style of annotation should I use in the Courpus?

Thx a lot if anyone knows ;)

Katrin says, "POS Tagger Models"
Katrin's picture

Thanks for the post, very useful.
I wonder how you managed to POS-tag the English sentence. From what I saw on the OpenNLP models download site, there is only a POS tagging model for Spanish. Could you give me a hint?

Paler says, "Nice post"
Paler's picture

Hi. I'd like to thank you for the starting push on OpenNLT. My first choice was to use Freeling, but doing a little previous search knowing from before just the name of the OpenNLP project, just decided to give it a try when I found this article... I don't know yet if I'll stick to OpenNLT or back to Freeling (it seems the homepage is not working for now...), but this article helped me to give it a first try. It seems OpenNLT is very easy to use, now I have to study if it is as extensive and complete as Freeling (I don't think so, but just a few days won't hurt my project).

Paler says, "PosTagging example"
Paler's picture

Sorry to use your blog for a little discussion, but I think this could help other people, including Katrin, who's asking how to parse english sentences...
This is a litle code that does something like that.

public class Main
{
  private static final String TOKENS = "path to EnglishTok.bin.gz";
  private static final String DICT = "path to dict.bin.gz";
  private static final String TAGDICT = "path to tag.bin.gz";
 
  private static final String TEXT = "This is a testing sentence";
 
  public static void main(String[] args)
  {
    try
    {
 
      TokenizerME tokenizer = new TokenizerME((new SuffixSensitiveGISModelReader(new File(TOKENS))).getModel());
      tokenizer.setAlphaNumericOptimization(true);
      String[] tokens = tokenizer.tokenize(TEXT);
 
      POSTaggerME postagger = new POSTaggerME(getModel(TAGDICT), new DefaultPOSContextGenerator(new Dictionary(DICT)));
      String[] tags = postagger.tag(tokens);
 
      if(tags!=null)
        for(int i=0;i<tags.length;i++)
          System.out.println("tag "+i+" = "+tags[i]);
 
    } catch (IOException ex)
    {
      Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
    }
  }
 
  private static MaxentModel getModel(String name)
  {
    try
    {
      return new SuffixSensitiveGISModelReader(new File(name)).getModel();
    } catch (IOException e)
    {
      e.printStackTrace();
      return null;
    }
  }
}

You just have to put the imports, replace the three constants at the begining for real paths and set the classpath (need maxent and trove, apart of OpenNLP, of course).

Off the record, just a cupple suggestions: the README file included in the OpenNLT bundle is very helpful, AND the clases mentioned there (PosTagger, SentenceDetector...) are examples of how to use the real util clases, like the ones used in the code above (POSTaggerME...), so you can just copy-paste code and make it work!

Hope this helps!

Thank for the space ;)

Edited: fixed the code tag as you suggested , Paler. I also set the language to Java so it's a bit easier to read. For future reference, you can use <code language="java">...</code>. Thanks for posting your code for others!

Post new comment

The content of this field is kept private and will not be shown publicly.
If you have a Gravatar account, used to display your avatar.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • You can enable syntax highlighting of source code with the following tags: <code>. Beside the tag style "<foo>" it is also possible to use "[foo]".
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

About

Daniel McLaren

Daniel is a Flash and Flex developer specializing in the art of information visualization.

Latest from SketchyD

Latest Drawing from SketchyD

This is the most recent drawing from my mobile sketch blog, SketchyD.com.

Recent comments