The purpose of this project is to take a legal document, like a contract, model the topics and create a pipeline to tag parts of the docu...
The purpose of this project is to take a legal document, like a contract, model the topics and create a pipeline to tag parts of the document with a relevant label. This notebook will focus on the the preprocessing of the data, the topic modeling and the creation of the training set. Ultimately the code in this repo will be useful for people who want to understand a complex legal document such as a credit card agreement more clearly.
The data comes from the following link: https://www.consumerfinance.gov/credit-cards/agreements/
The Consumer Financial Protection Bureau (CFPB) collects credit card agreements from creditors on a quarterly basis and posts them at the link above. The CFPB organizes the data by putting each participating company in a directory and then collecting all the statements in a directory for each company. For Q4 of 2018 there are 652 companies and each company has on average 2-4 agreements.
For most people contract documents are not fun to read because they are usually written in complex legal jargon and the style of writing is purposely dry so as to spell out worst-case scenarios. That said it is important to understand what you or your business is getting into before signing any sort of agreement. Because it takes a certain type of expertise to understand these documents I feel it would be interesting to see if we can leverage natural language techniques to tag this these documents
This repo will enable you to insert a credit card agreement pdf and output labeled sections of the documents to make it easier to read the document. Please see example.ipynb for a walkthrough on how to use this repo.
The notebook contract_reader.ipynb has further details on how the repo is constructed.
GitHub
Pandas
Machine learning
View more
GitHub
Pandas
Machine learning
Nltk
Python 3
NLP
View more
Overall the dataset contains over 200K headlines from the Huffington Post between 2012 and 2018. The dataset has six columns that capture...
Overall the dataset contains over 200K headlines from the Huffington Post between 2012 and 2018. The dataset has six columns that capture the category, headlines, author, link, description, and date the article was published. Overall there are 40 different categories ranging from politics to education. In general the top categories are politics, wellness, and entertainment. For the purposes of this notebook we won't be using the other columnns but it is worthy noting that each date may have more than one headline. More information about the data can be found below this abstract.
The goal of the notebook will be to take the headline column and use topic modeling to recreate the categories. Since we already have hand-labeled category information it will be interesting to see if our models match the ground truth data that we have. To accomplish this we will use non-negative matrix factorization (NMF) to 1) choose the optimal number of topics and 2) associate documents/terms with those topics. NMF is explained in further detail below, but basically it decomposes a document-term matrix into factors by which you can parse document/topics and document/terms from.
The project will happen in multiple stages consisting of 1) preprocess the text, 2) create a document-term matrix using tf-idf 3) create the NMF model using the doc-term matrix 4) select the optimal number of topics using word2vec and calculate topic coherence 5) based on the optimal k topics print out top terms, documents and compare to original labels for accuracy.
Python
Pandas
Machine learning
View more
Python
Pandas
Machine learning
Nltk
NLP
View more