Introducing fastText
Learn about fastText in this article by JoydeepBhattacharjee, a Principal Engineer at Nineleaps Technology Solutions who primarily develops intelligent systems that can parse and process data to solve challenging problems at work.
Fasttext, as a software tool, is an amalgamation of the cutting edge algorithms in natural language processing. It is a library that helps you to generate efficient word representations and gives you support for text classification out of the box. fastText claims that it is superior in terms of yet unknown words, and can handle different languages for which sufficiently large data sources and corpora may not be available.
In today's interconnected world, a lot of text data gets generated around the world. This text information includes descriptions of things. Take, for example, people writing about products in Amazon reviews or people writing about their thoughts through their Facebook posts. Natural Language Processing (NLP) is the application of machine learning and other computational techniques for understanding and representing spoken and written text. The following are the major challenges that NLP seeks to solve:
• Topic modeling: In general, texts deal with a topic. Topic modeling is frequently used to determine hidden structures or "abstract topics" that may be present in a collection of documents. An effective application of topic modeling would be summarization. For example, legal documents are quite complex and verbose, and hence systems such as these would help the reader to get the gist of the document and a high-level description of what is happening.
• Sentence classification: Text classification is an important challenge, where we are able to take in blobs of text and classify them into different labels. For example, a system should be able to correctly classify something like "Shahrukh Khan was on fire at Dubai event" as belonging to the label "Entertainment" and another sentence, "Fire breaks out in the store opposite Breach Candy Hospital," to be categorized as “News”.
• Machine translation: The total number of languages in the world is at least 3,000. About half of these languages have fewer than 10,000 speakers and about 25 percent have less than 1,000 speakers. Hence, we can imagine that a lot of languages are dying and when a language dies, collectively we lose a lot of our cultural heritage. The best translation system right now is made by Google, but it covers only 103 languages at the time of writing.So, it is very important that we develop machine learning translation models that are able to train from a few sources with a high degree of predictive power.
• Question and answer (QA) systems: The focus here is to build a system that automatically answers questions based on the questions that people ask in natural language. QA systems that can be built around closed domain systems can be highly accurate as they can retrieve documents and text that are relevant to the search item.
• Sentiment analysis: Sentiment analysis is about understanding the needs and intents that the users share when talking about something. People make choices based on emotions. The needs of many people are largely emotional and, generally, people are very forthcoming about how they feel. Creating a system that takes this into account will always add a lot of value to the business.
• Event extraction: Use cases involve where a lot of data is stored in the form of text. For example, some legal text may be describing a "crime" event, which is followed by an "investigation" event, which is followed by multiple "hearing" events. The events themselves may be nested such that the "hearing" events may consist of a "presenting arguments" events and a "presenting evidence" events.
• Named entity detection: The focus of building this system is to extract and classify entities or specific information as per some predefined categories, such as people, organization, geography, and so on. For example, if we take the following text: "We're used to spicy foods down here in South Texas," we can understand that the "buyer" likes "spicy foods" and his "geography" is South Texas. If there is sufficient evidence received from the data that buyers in South Texas like spicy foods, more such foods can be marketed to them.
• Relation detection: A relation detection system parses text and identifies focal points and agents, then tries to find the relationship between them. For example, the sentence "Mike has the flu" can be converted to Person-[RELATION:HAS]->Disease. These relations can then be explored in a business context to build intelligent apps.
The previous list has many of the problems that NLP practitioners are targeting. Depending on the use case, you can pick up any of these challenges and try to solve them in your domain. The challenge with many previous approaches and modeling techniques is that NLP requires a lot of textual data and there is a lot of contextual information in the data. It is quite hard for a computational model to get a sense of all the data in an efficient manner.
NLP models up to now have only targeted English as textual data is available in English. But only 20 percent of the population of the world speaks English and even among them, the majority is non-native speakers. The biggest deterrent to building non-English NLP models is the lack of data. Hence, we desperately need libraries that can build models even when the data is limited. fastText has the potential to change all that. The fastText team has published pre-trained word vectors for 294 languages. By the time the book is published, more languages will have been added to it.
Fasttext is lightweight and does not have huge software or hardware needs. Unlike other machine learning tools, you don't need massive GPU clusters to run fasttext. Fasttext runs on the CPU.You can compress the models to 1-2 MB sizes and load it in small devices such as mobile or RPI. It runs on all popular distributions such as Linux, Mac or windows.
However, fastText comes with some of its own challenges:
• The algorithms in fasttext are cutting edge and developers might not be willing to transition to the new algorithms
• Fasttext software is primarily command line based and not many researchers are command line savvy
• There is a lack of integration tools with other popular machine learning tools
Looking into the future, we know that word vectors are an amazingly powerful concept and a technology that will enable significant breakthroughs in NLP applications and research. They highlight the power of learned representations of input data in hidden layers. Building better NLP applications inevitably need a good understanding of word vectors and fasttext will greatly help towards fostering this understanding.
If you found this article interesting, you can explore fastText Quick Start Guide for performing efficient fast text representation and classification with Facebook's fastText library. fastText Quick Start Guide is an ideal introduction to fastText. You can learn how to create fastText models from the command line, without the need for complicated code.