Codementor Events

Java Regular Expression: part 7 - Extracting text with java.util.Scanner

Published Jan 22, 2019Last updated Apr 05, 2019
Java Regular Expression: part 7 - Extracting text with java.util.Scanner

Hello, and welcome back. In this part, I’m going to show you how we can use the Scanner class in the package java.util to extract text from a string.
Before delving into examples, let’s discuss the two concepts of tokens and delimiters.
In a string, there are 2 types of text: tokens and delimiters. Tokens are meaningful words, and delimiters are characters that separate tokens.
For instance, I have a string: I love you so much.
So, in the string:
tokens: I, love, you, so, and much
delimiters: whitespace characters.
However, what are tokens and delimiters largely depending on our purposes. For instance, we can use whitespaces as delimiters; or we can specify any characters to work as delimiters. Almost all of the string splitting methods in Java uses whitespace character as the default delimiter.
In this session, we will use the Scanner class to extract text based on particular delimiters, or based on a pattern.
Very often, we use the scanner class to get the input from users via console user interface by specifying the System input as its constructor parameter.
Actually, the Scanner class can take its source of input as system input, or a string variable, or a file.
The first method we can use to split a string into tokens is the next() method in the Scanner class, which uses whitespace as the default delimiter.
Let’s see some code:

import java.util.Scanner;
public class Demo {
    public static void main(String[] args) {
        Scanner sc;
        String s = "I love you so much. I want to marry you";
        sc = new Scanner(s);
        while (sc.hasNext()) {
            String token = sc.next();
            System.out.println(token);
        }
    }
}

In the above code, first an instance of the Scanner was created. And instead of passing System.in object to the constructor, I passed the string variable s because we needed to read and then manipulate the string:

String s = "I love you so much. I want to marry you";
sc = new Scanner(s);

Next, I used a while loop to traverse all the tokens in the string. In the while loop’s condition, the hasNext() method was invoked to check if there was any more next token:

while (sc.hasNext())

The hasNext() method returns true if there is more token, otherwise false is return, which also indicates that the end of the string has been reached.
By default, the hasNext() method uses whitespaces as delimiter to separate and navigate among tokens.
The hasNext() method reads a token, and stops if it reaches a delimiter. If the method is being kept invoking, it reads the next token and stops if it reaches another delimiter. The whole process repeats until there is no more token.
Actually, the hasNext() method does not read the tokens, rather it just checks to see if there are any more tokens left.
The one that actually reads and returns the tokens is the next() method.

String token = sc.next();

Run the program and we will get the outputs:

I
love
you
so
much.
I
want
to
marry
you

We’ve got the above results because as mentioned earlier, by default the hasNext() method uses whitespace as delimiters.
However, we can inform the hasNext() method to use any characters as delimiters.
For instance, now I want to use both whitespace and dot (.) characters as delimiters.
That can be done like this:

import java.util.Scanner;
public class Demo {
    public static void main(String[] args) {
        Scanner sc;
        String s = "I love you so much. I want to marry you";
        sc = new Scanner(s);
        sc.useDelimiter("[ .]");
        while (sc.hasNext()) {
            String token = sc.next();
            System.out.println(token);
        }
    }
}

In the code, I have added the following method call:

sc.useDelimiter("[ .]");

The useDelimiter() method is used to inform the hasNext() method what to use as delimiters. And as you can observe, I have specified whitespace and dot characters.
Note that when we use customed delimiters with the useDelimiter() method, whitespace characters are no longer the default ones. Therefore, if you want to use white space as delimiters, you need to explicitly claim that as we just did.
Now it’s time to run the program:

I
love
you
so
much

I
want
to
marry
you

In the output, you can see there was an empty line. That’s because we have used both whitespace and dot characters as delimiters. And there was time these 2 characters came right next to each other (between the word much and I).
If we want to treat 2 (or more) delimiter characters being right next to each other as a single one, then we need to apply one quantifier character as follows:

sc.useDelimiter("[ .]+");

Run the program again:

I
love
you
so
much
I
want
to
marry
you

And the empty line has been removed.
Besides using specific characters as delimiter, we can also specify a regular expression as delimiters.
Suppose I have the following string:

I love you 4 so much. 34 I 23 want to marry you

There are digits in the string and I want to break the string into substrings based on those digits.
I can achieve the task as follows:

import java.util.Scanner;
public class Demo {
    public static void main(String[] args) {
        Scanner sc;
        String s = "I love you 4 so much. 34 I 23 want to marry you";
        sc = new Scanner(s);
        sc.useDelimiter("\\d+");
        while (sc.hasNext()) {
            String token = sc.next();
            System.out.println(token);
        }
    }
}

As you can notice, I have use a digit pattern as parameter in the useDelimiter() method:

sc.useDelimiter("\\d+");

And also notice that we need to use the plus (+) sign so that if there are digits right next to each other, they will be treated as a single digit.
Run the program and we will have:

I love you 
 so much. 
 I 
 want to marry you

With the above case, we can also do the reverse, which means we can retrieve all the numbers: 4, 34, and 23. That means I will use all the characters as delimiters except digits.
To achieve the task, all we need to do is to make a minor change in the pattern:

import java.util.Scanner;
public class Demo {
    public static void main(String[] args) {
        Scanner sc;
        String s = "I love you 4 so much. 34 I 23 want to marry you";
        sc = new Scanner(s);
        sc.useDelimiter("[^\\d]+");
        while (sc.hasNext()) {
            String token = sc.next();
            System.out.println(token);
        }
    }
}

As you can see, I have used the caret sign (^) right before \d, which I think you still remember that the caret sign means ‘except’.
So, the pattern means any characters will be used as delimiters except digits.
Run the program and we got:

4
34
23

Previous part
Next part

--

Visit learnbyproject.net for a free Regular Expression courses and other free courses

Discover and read more posts from Sera.Ng
get started