II - Tokenization using Apache OpenNLP

Ankit Agrahari
Nov 14, 2021
3 min read

In this post we will create tokens of the given string using Apache OpenNLP library.

This post is a continuation of the previous post Find Sentences using Apache Open NLP. Refer to this post first for the setting up your project with Apache OpenNLP library. In the previous post we used SentenceDetecterME, SentenceModel, and Span class of the openNLP library.

Apache OpenNLP library provides the SimpleTokenizer, WhitespaceTokenizer and TokenizerME which will use the normal process to tokenize by splitting the string either on whitespace or using some already trained models. These classes doesn't expose their constructors, so we have to use INSTANCE to get the instance of these classes.

SimpleTokenizer

This will use split the sentences into each word or any exclamation, dot, comma etc used. It provides two methods to tokenize a given string. We will try both the methods.

First step is to get the instance of the SimpleTokenizer

//Get the Simple Tokenizer Instance
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;

Use this object to get the tokenize() method which will take the string as input.

//Call the tokenize method
String[] tokens = tokenizer.tokenize(text);

Print the tokens and the output will be

//Print the tokens
System.out.println(Arrays.toString(tokens));

Hi              .               How
are             you             ?
Hope            everything      is
going           well            .
Welcome         to              ChatBot
Lesson1         .               We
will            try             to
understand      Open            Apache
NLP             for             sentence
detection

There is another tokenizePos() method which will print the position of the tokens. Its return type is an array of Span object.

Span[] spans = tokenizer.tokenizePos(text);
for(Span s: spans){
    System.out.println(s);
}

The output of the above statement will be like this:

[0..2)
[2..3)
[4..7)
[8..11)
[12..15)
[15..16)

Whitespace Tokenizer

This tokenizer will split the sentence based on white space. It also has two methods to returns tokens as string array and another tokenizePos() method which return an array of Span objects.

Here is the code to get the WhiteSpaceTokenizer instance and printing the tokens and their position.

//Get the Simple Tokenizer Instance
WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;

//Call the tokenize method, this will split based on spaces
String[] tokens = tokenizer.tokenize(text);

//Print the tokens
for(String s: tokens){
    System.out.println(s);
}

Span[] spans = tokenizer.tokenizePos(text);
for(Span s: spans){
    System.out.println(s);
}

The output will be like this:

Hi.            How               are
you?           Hope              everything
is             going             well.
Welcome        to                ChatBot
Lesson1.       We                will
try            to                understand
Open           Apache            NLP
for            sentence          detection

[0..3)           [4..7)              [8..11)
[12..16)         [17..21)            [22..32)
[33..35)         [36..41)            [42..47)
[48..55)         [56..58)            [59..66)
[67..75)         [76..78)            [79..83)
[84..87)         [88..90)            [91..101)
[102..106)       [107..113)          [114..117)
[118..121)       [122..130)          [131..140)

The implementation is similar to the SimpleTokenizer.

TokenizerME

This class will create tokens based on the trained model it is using. It loads a particular model to TokenizerModel and then loads it to this class.

Load the trained model into TokenizerModel object. I have used an existing model which comes with Apache OpenNLP library.

//Load the token models
InputStream is = new FileInputStream("src\\main\\resources\\models\\en-token.bin");
TokenizerModel model = new TokenizerModel(is);

This class has three methods tokenize(), tokenizePos() and getTokenProbablities() methods. It can used like this:

//Print the tokens
for(String tokens: tokenizerME.tokenize(text)){
    System.out.println(tokens);
}

//Get the position of the tokens
Span[] spans = tokenizerME.tokenizePos(text);
for(Span span: spans){
    System.out.println(span);
}

//Get the token probabilities
double[] prob = tokenizerME.getTokenProbabilities();
for(double s: prob){
    System.out.println(s);
}

The output of the above code will be like this:


Hi              .               How 
are             you             ?
Hope            everything      is
going           well            .
Welcome         to              ChatBot
Lesson1         .               We
will            try             to
understand      Open            Apache 
NLP             for             sentence
detection

[0..2)      [2..3)      [4..7)
[8..11)     [12..15)    [15..16)
[17..21)    [22..32)    [33..35)
[36..41)    [42..46)    [46..47)
[48..55)    [56..58)    [59..66)
[67..74)    [74..75)    [76..78)
[79..83)    [84..87)    [88..90)
[91..101)   [102..106)  [107..113)
[114..117)  [118..121)  [122..130)
[131..140)
You can find the code here in the Github repository.

References

This concludes the tokenization using Open ApacheNLP library in Java. This will be another step in using natural language processing.

Also, please comment down with your feedback or any suggestions.

II - Tokenization using Apache OpenNLP

SimpleTokenizer

Whitespace Tokenizer

TokenizerME

References

Recent Posts

Comments