Codementor Events

Java Regular Expression: part 8 - Extracting text with String.Split()

Published Apr 05, 2019
Java Regular Expression: part 8 - Extracting text with String.Split()

The String class provides developers another option to split a string into words based on certain delimiters is the split() method.
The split() method can break up a string into tokens with certain delimiters just like the Scanner class we have come across in previous part.
However, there are 2 main differences from the Scanner class:

  • The split() method does not base on whitespace characters as delimiter. Therefore, if we wish to use whitespace characters as delimiters, we need to explicitly specify so.
  • The split() method returns a string array containing extracted tokens. This is very convenient if we plan to process those tokens later on.
    Let’s see an example:
public class Demo {
    public static void main(String[] args){
        String tokens[];
        String s = "I love you so much! But I cannot marry you.";
        tokens = s.split("[ ]");
        for (String token : tokens) {
            System.out.println(token);
        }
    }
}

In the program, I have the following string:

String s = "I love you so much! But I cannot marry you.";

I want to break the string into substrings or tokens based on whitespace characters. I can achieve the task as follows:

tokens = s.split("[ ]");

Since the split() method returns an array of extracted tokens, we need a loop or the likes to get those tokens:

for (String token : tokens) {
    System.out.println(token);
}

Run the program and we have outputs:

I
love
you
so
much!
But
I
cannot
marry
you.

In case you want to specify more characters as delimiters, you can do as below, which I use both whitespace and exclamation mark (!):

tokens = s.split("[ !]");

The complete example:

public class Demo {
    public static void main(String[] args){
        String tokens[];
        String s = "I love you so much! But I cannot marry you.";
        tokens = s.split("[ !]");
        for (String token : tokens) {
            System.out.println(token);
        }
        System.out.println("Number of tokens: " + tokens.length);
    }
}

Note that in the above example, I also printed out the length of the token array which of course were the number of extracted tokens.
I have the output:

I
love
you
so
much

But
I
cannot
marry
you.
Number of tokens: 11

That’s because I have used both whitespace and exclamation mark as delimites and there were time these two characters appearing right next to each other. And that caused the split() method to treat them as an empty token.
If we want to remove the empty token, which means to treat adjacent delimiters as one, we just need to add the plus (+) sign at the end of the pattern. Like below:

tokens = s.split("[ !]+");

Run the program and we have the following result:

I
love
you
so
much
But
I
cannot
marry
you.
Number of tokens: 10

And as you can see, we now have only 10 tokens.
Apart from using specific characters as delimiters, we can supply regular expression to the split() method as parameters.
Let’s see the following program:

public class Demo {
    public static void main(String[] args){
        String tokens[];
        String s = "I love you 4 so much. 34 I 23 want to marry you";
        tokens = s.split("[\\s\\d]+");
        for (String token : tokens) {
            System.out.println(token);
        }
        System.out.println("Number of tokens: " + tokens.length);
    }
}

In the program, I have the following string:

String s = "I love you 4 so much. 34 I 23 want to marry you";

And I want to retrieve tokens based on digits and whitespace characters. I can write the split() method as follows:

tokens = s.split("[\\s\\d]+");

In the parameterized pattern, I have

  • \s: represents for whitespace characters. Also keep in mind that whitespace characters include: space, tab, newline (\n), line feed (\f), and carriage return (\r)
  • \d: represents for digits as you should be familiar with already
    Run the program and we have:
I
love
you
so
much.
I
want
to
marry
you
Number of tokens: 10

As you can see in the output, we have totally 10 tokens, which means there was an empty token.

Previous part

Next part

--

Visit learnbyproject.net for a free Regular Expression courses and other free courses

Discover and read more posts from Sera.Ng
get started