R : word frequency in dataframe

Published May 05, 2020

Alright so in the short tutorial we'll calculate word frequency and visualize it.

It's relatively simple task.
BUT when it comes for stopwords and language different from English, there might be some difficulties.

I've a dataframe which has field text is russian language.

Step 0 : Install required libraries

packages.install("tidyverse")
packages.install("tidytext")
packages.install("tm")
library(tidyverse)
library(tidytext)
library(tm)

Step 1 : Create stopwords dataframe

#create stopwords DF
rus_stopwords = data.frame(word = stopwords("ru"))

Step 2 : Tokenize

new_df <- video %>% unnest_tokens(word, text) %>% anti_join(rus_stopwords)


# - anti_join  - functoin to remove stopwords
#video - is name of dataframe
#word - is name of new field
#text - is just a filed with our text

Step 3 : Count words

frequency_dataframe = new_df %>% count(word) %>% arrange(desc(n))

Step 4 (Optional) : Take only first 20 items from a dataframe

short_dataframe = head(frequency_dataframe, 20)

Step 5 Visualize with ggplot

ggplot(short_dataframe, aes(x = word, y = n, fill = word)) + geom_col()

So in my case it looked looked like this:

Screenshot 2020-05-05 at 11.50.18.png

R Machine learning Data visualization Data Science

Report

Enjoy this post? Give Alex Polymath a like if it's helpful.

Alex Polymath

Let me solve some troubles

Hello, my name is Alex Polymath! I'm indie-hacker, creator of 📷 http://colorize.cc - photo colorization + restoration 🧑‍🎨 https://portret.ai/ AI avatars generation I've been doing complex admin dashboards : for video manag...

Discover and read more posts from Alex Polymath

get started