Getting started with OpenNLP (Natural Language Processing)
I found a great set of tools for natural language processing. The Java package includes a sentence detector, a tokenizer, a parts-of-speech (POS) tagger, and a treebank parser. It took me a little while to figure out where to start so I thought I'd post my findings here. I'm no linguist and I don't have previous experience with NLP, but hopefully this will help some one get setup with OpenNLP.
What do these tools do?
Given the input sentences below, we'll take a look at what these tools actually do.
Sentence Detector
Straight-forward: it detects sentences. This is more complicated than it sounds, since sentences don't only end with periods and dialogue can also complicate things. Fortunately for us, all of this is handled by their code and we just grab the sentence tokens. Just like the examples in the README, you'll probably start most processing with this because the other tools deal with one sentence at a time.
The sentence detector returns an array of strings. In our example the first element would be:
Tokenizer
The POS tagger and treebank parser both need to have sentences broken down into tokens separated by spaces. Tokens are usually words but I noticed that some words get split into multiple tokens. For example, "don't" gets split into "do" and "n't," after it's uncontracted form, "do not." Some punctuation also gets split into separate tokens. Here's what it does to our sentence:
Note that the "n't" has become a separate token. The same happened to the contracted "have" and the period has also become a token.
POS Tagger
The tagger uses a dictionary of tags and a trained model to apply parts of speech tags (verb, adverb, personal pronoun) to each token in a sentence. The tagging output conforms to the "Penn Treebank Style." Here's the result of tagging the tokenized sentence:
As you can see, each token has been appended with a slash followed by a POS tag. I found this parts of speech reference useful for understanding the tags.
While the tagger can give you quite a bit of information about the sentence, it doesn't tell you very much about the sentence structure.
Treebank Chunker
The goes a little further in showing sentence structure by breaking the sentence into simple chunks. Noun phrases and verb phrases are recognized and tagged appropriately. Taking our example sentence, we get something like this:
This is pretty useful output. Although it doesn't provide as much information as the parser, it does load up a lot quicker and doesn't require as much memory.
Treebank Parser
This is the big kahuna. You can tell by the resources it consumes. The parser tags tokens and groups phrases into a hierarchy, building sentence trees of Parse objects. Each of the possible trees for the sentence are also given a probability which indicates the likelihood that this is the correct way to interpret the sentence. The parser uses the models in the parser models directory, takes about thirty seconds to start up on my machine, and ends up using around 300MB of memory. Once loaded, however, the actual text parsing happens pretty quickly.
Here's the tree generated for our example sentence:
Some code to get you going
String paragraph = "..."; // the sentence detector and tokenizer constructors // take paths to their respective models SentenceDetectorME sdetector = new SentenceDetector("models/sentdetect/EnglishSD.bin.gz"); Tokenizer tokenizer = new Tokenizer("models/tokenize/EnglishTok.bin.gz"); // the parser takes the path to the parser models // directory and a few other options boolean useTagDict = true; boolean useCaseInsensitiveTagDict = false; int beamSize = ParserME.defaultBeamSize; double advancePercentage = ParserME.defaultAdvancePercentage; ParserME parser = TreebankParser.getParser( "models/parser", useTagDict, useCaseInsensitiveTagDict, beamSize, advancePercentage); // break a paragraph into sentences String[] sents = sdetector.sentDetect(para.toString());
Now we feed each of the sentences to the tokenizer, and pass the output to the parser.
String sent = sents[0]; // tokenize brackets and parentheses by putting a space on either side. // this makes sure it doesn't get confused with output from the parser sent = untokenizedParenPattern1.matcher(sent).replaceAll("$1 $2"); sent = untokenizedParenPattern2.matcher(sent).replaceAll("$1 $2"); // get the tokenizer to break apart the sentence String[] tokens = tokenizer.tokenize(sent); // build a string to parse as well as a list of tokens StringBuffer sb = new StringBuffer(); List<String> tokenList = new ArrayList<String>(); for (int j = 0; j < tokens.length; j++) { String tok = convertToken(tokens[j]); tokenList.add(tok); sb.append(tok).append(" "); } String text = sb.substring(0, sb.length() - 1).toString();
The parser takes a two-layer tree of Parse objects. The parent Parse object holds a list of child Parse objects—one for each token in the sentence.
// the parent parse instance spans the entire sentence Parse p = new Parse(text, new Span(0, text.length()), "INC", 1, null); // create a parse object for each token and add it to the parent int start = 0; for (Iterator ti = tokenList.iterator(); ti.hasNext();) { String tok = (String) ti.next(); p.insert(new Parse(text, new Span(start, start + tok.length()), ParserME.TOK_NODE, 0)); start += tok.length() + 1; } // fetch multiple possible parse trees Parse[] parses = parser.parse(p,numParses);
At this point, parses holds the different possible parse trees for the sentence. Now, figure out what you want to do with it, you must!
Delicious
Digg
Reddit
Facebook
Google
Yahoo
Technorati

Comments
Hi, nice post thanks. However the final java sample code is barely readable. Could you please add some <pre/> tag around it?
Looks like this page lost some styling when I moved from Wordpress to Drupal. Looks much better now!
what is the defination of untokenizedParenPattern1 and 2?
Hi Anon,
The definitions are as follows:
If I remember correctly, these patterns are used to distinguish parentheses that were part of the input string and those that denote tokens. Hope that helps.
The function convertToken is not shown in the code given. Can you please include that function here? Thanks.
Btw, this page is REALLY useful as there is NO documentation on how to work with OpenNLP except the JavaDocs, which aren't helping much to a n00b like me.
Hoping for a quick response.
Gunith
Hi Gunith,
Here's the code for the convertToken method. This is also part of the code that protects the parentheses and braces from being interpreted as output from the parser.
thanks for the tutorial
is there a way to obtain the canonical versions of tokens: e.g. USA and U.S.A --> usa
thanks
Would be great if you could include the piece of code that creates & calls the Chunker as well...
thanks so much,
/d
This document is very useful for me.
May I ask you some question I encounter as follows:
When I compiled the above code you shared, some errors happened
bad class file: D:\Tools-N\Java-API\opennlp-tools-1.3.0\lib\opennlp-tools-1.3.0.jar(opennlp/tools/sentdetect/SentenceDetectorME.class)
class file has wrong version 50.0, should be 49.0
Please remove or make sure it appears in the correct subdirectory of the classpath.
SentenceDetectorME sdetector = new SentenceDetector("D:\\Tools-N\\Machine Learning+NLP\\OpenNLP\\sentdect\\EnglishSD.bin.gz");
1. Could you share me your the version of opennlp-tools (1.3.0 or 1.0.0)
2. I got the model file about sentdect or tokenize (i.e., EnglishSD.bin.gz and EnglishTok.bin.gz), but I am not sure how where the correct location is. I put them on the absolute path like "SentenceDetectorME sdetector = new SentenceDetector("D:\\Tools-N\\Machine Learning+NLP\\OpenNLP\\sentdect\\EnglishSD.bin.gz")";
is it correct?
Sincerely yours
Great job! However I met some problems placing the .gz files, could you release your source code and explain how to run it? thx!
Hi Sophia,
Unfortunately I haven't touched NLP for ages (this blog post is a year old) and I don't have time to put together a sample application. However, the .gz files in my project are in a structure like below. Hope that helps you!
Hi Teng-Kai Fan,
I've posted my version of the OpenNLP library. Not sure if it'll work for your setup so you may have to compile the library yourself. Either way, good luck.
In answer to your second question, I've posted the folder structure in the comment above. Best of luck with the project!
great tutorial thx!!!!
i am having some problem runing the code, im new in java and having alot of problems. after runing i get this msg on the console:
java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.(Unknown Source)
at java.util.zip.GZIPInputStream.(Unknown Source)
at opennlp.maxent.io.SuffixSensitiveGISModelReader.(SuffixSensitiveGISModelReader.java:50)
at opennlp.tools.lang.english.Tokenizer.(Tokenizer.java:36)
at test.Main.main(Main.java:24)
do you know what could be the problem...
Thx for your website.
Without this it would have been impossible to start with OpneNLP - you know i'm completely new to java.
Hi Daniel, hi guys
As I posted yesterday I'm completely new to Java, so I have some basic problems running OpenNLP tools.
I'm working on a Windows machine.
I'm following step by step all the Daniel' tips and I also followed the Daniel' hint to build my dir tree.
Everything works well but I cannot run the Treebank Parser - I'm using Eclipse: when I run the tool I give one String argument to the TreebankParser, and I write down the path: models\parser.
In this case the console prints out:
""Exception in thread "main" java.io.FileNotFoundException: models\parser\tag.bin.gz (Impossibile trovare il file specificato)""
That means impossible to find the specified file.
At this point I copy the file tag.bin.gz in the madels\parser dir, I run again the tool and what I have is:
""Exception in thread "main" java.io.FileNotFoundException: models\parser\head_rules (Impossibile trovare il file specificato)""
And finally if I put the head_rules dir inside the models\parser dir , I run again the tool, it says me that the access to the folder is denied.
What should I do?
Thx
OK I can post and answer by myself
I solved the problem ;)
The only one different thing i made from Daniel is puting the file
tag.bin.gz also in the models/parser direcory, and creating one file called head_rules, without any extension, in the same directory: it works!
Hey,
Soana, you're supposed to download the models to be used in conjunction with the OpenNLP jar file. Just creating the files and not using the models won't be very helpful. Go to http://opennlp.sourceforge.net/models.html to check it out.
Thx quek, anyway I already downloaded that models: it was just a problem of setting up directories.
Well, now I have a big challenge : I need to understand how to create models in Italian language, so I need to know which file format should the trainig data need in order to create a new model.
I am looking in a lot of websites but really I cannot find any tip!
So the first simple questions are:
1) Should I need an italian Corpus for training models?
2) If I need a Corpus, which style of annotation should I use in the Courpus?
Thx a lot if anyone knows ;)
Thanks for the post, very useful.
I wonder how you managed to POS-tag the English sentence. From what I saw on the OpenNLP models download site, there is only a POS tagging model for Spanish. Could you give me a hint?
Hi. I'd like to thank you for the starting push on OpenNLT. My first choice was to use Freeling, but doing a little previous search knowing from before just the name of the OpenNLP project, just decided to give it a try when I found this article... I don't know yet if I'll stick to OpenNLT or back to Freeling (it seems the homepage is not working for now...), but this article helped me to give it a first try. It seems OpenNLT is very easy to use, now I have to study if it is as extensive and complete as Freeling (I don't think so, but just a few days won't hurt my project).
Sorry to use your blog for a little discussion, but I think this could help other people, including Katrin, who's asking how to parse english sentences...
This is a litle code that does something like that.
You just have to put the imports, replace the three constants at the begining for real paths and set the classpath (need maxent and trove, apart of OpenNLP, of course).
Off the record, just a cupple suggestions: the README file included in the OpenNLT bundle is very helpful, AND the clases mentioned there (PosTagger, SentenceDetector...) are examples of how to use the real util clases, like the ones used in the code above (POSTaggerME...), so you can just copy-paste code and make it work!
Hope this helps!
Thank for the space ;)
Edited: fixed the code tag as you suggested , Paler. I also set the language to Java so it's a bit easier to read. For future reference, you can use <code language="java">...</code>. Thanks for posting your code for others!
Post new comment