Documents Summary in Python
Hello Coders!
This article explains how to process documents using Python. Analyzing the content of a Microsoft Word document, extracting text, and creating a summary involves a multi-step process. Python provides libraries that can help you achieve these tasks.
Here's a general approach
✅ Install Libraries
You'll need the docx2txt
library to extract text from Word documents and the nltk
library for text summarization. Install them using pip
:
pip install python-docx nltk
✅ Extract Text from Word Document
Use the docx2txt
library to extract text from the Word document:
import docx2txt
# Replace 'document.docx' with your file path
text = docx2txt.process("document.docx")
✅ Preprocess Text
Before summarization, it's a good idea to preprocess the extracted text. You can use the nltk
library to tokenize the text into sentences and words:
from nltk.tokenize import sent_tokenize, word_tokenize
sentences = sent_tokenize(text)
words = word_tokenize(text)
✅ Text Summarization
To create a summary, you can use various techniques. One common approach is extractive summarization, where important sentences from the text are selected to form the summary. The nltk
library offers a basic implementation of the TextRank algorithm for this purpose:
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
def sentence_similarity(sent1, sent2, stopwords=None):
if stopwords is None:
stopwords = []
words1 = [w.lower() for w in sent1]
words2 = [w.lower() for w in sent2]
all_words = list(set(words1 + words2))
vector1 = [0] * len(all_words)
vector2 = [0] * len(all_words)
for w in words1:
if w not in stopwords:
vector1[all_words.index(w)] += 1
for w in words2:
if w not in stopwords:
vector2[all_words.index(w)] += 1
return 1 - cosine_distance(vector1, vector2)
def build_similarity_matrix(sentences, stop_words):
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j], stop_words)
return similarity_matrix
def generate_summary(text, top_n=5):
stop_words = stopwords.words('english')
sentences = sent_tokenize(text)
sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
summary = ""
for i in range(top_n):
summary += " " + ranked_sentences[i][1]
return summary
# Generate a summary with the top 5 sentences
summary = generate_summary(text, top_n=5)
The above code defines functions to calculate sentence similarity and generate a summary using TextRank Algorithm.
Please note that this is a simplified example, and there are more advanced methods and libraries available for text summarization.
Additionally, the quality of the summary might vary depending on the complexity of the document and the chosen summarization technique.
Thank you!
Resources
- 👉 Generate Django Apps using
Rocket Generator
- 👉 Join the Community and chat with the
support
team
In the realm of Python programming, documents summary takes center stage, acting as the https://magicalkits.com/ of efficiency. With the wave of a code wand, Python scripts can extract key insights from lengthy documents, unveiling their essence in a concise manner. This process, akin to pulling a rabbit from a hat, empowers users to swiftly comprehend the contents of extensive files, enhancing productivity and understanding in one magical stroke.