Today, we’re taking a dive into automating summarization of online articles using Python, and the BART
model from Hugging Face
. Let’s dive right in!
Getting Started
Make sure you have the transformers
and torch
libraries installed. You can do this by using pip:
pip install torch
pip install transformers
pip install coloredlogs
Initializing the Summarizer and setting up our logger
We begin by clearing our CUDA cache, then determining if we can run our operations on a GPU or if we need to use a CPU. The pipeline
function helps us create our summarizer:
import torch
import logging
import coloredlogs
from transformers import pipeline
# Initialize the logger with colors
logger = logger.getLogger(__name__)
coloredlogs.install(level='INFO', logger=logger)
torch.cuda.empty_cache()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)
The summarizer model in the pipeline above is facebook/bart-large-cnn
and it is a BART model.
BART stands for Bidirectional and Auto-Regressive Transformers. It’s a model developed by Facebook’s AI team. Unlike traditional Transformer models that process input sequences in one direction (either left-to-right or right-to-left), BART processes inputs in both directions, which allows for a richer understanding of context.
The ‘auto-regressive’ part refers to the way BART generates output sequences: it predicts each subsequent token based on the ones it has already generated, in addition to the input.
BART has been found to perform excellently in a variety of tasks, such as text generation, translation, and summarization, among others. For our use-case, we’re leveraging a variant of BART, bart-large-cnn, which has been fine-tuned specifically for summarizing news articles.
This model has been made available through the Hugging Face transformers library, a popular repository of pre-trained models for natural language processing tasks.
Building the Summarizer Function
Next, we’ll define a method called summarize_news()
. This method takes in a URL for a news article and uses our summarizer to, well, summarize it!
def summarize_news(url, summarizer, max_length=130, min_length=30):
article_text = scrape_website(url)
if not article_text:
logger.warning(f"No text scraped from url: {url}. Skipping summarization.")
return None
try:
# Split the article_text into chunks of approximately 1024 tokens
chunks = [article_text[i:i + 1024] for i in range(0, len(article_text), 1024)]
# Summarize each chunk separately
summaries = []
for chunk in chunks:
logger.info("Chunking")
summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)[0][
"summary_text"]
summaries.append(summary)
# Combine all chunk summaries into one summary
combined_summary = ' '.join(summaries)
return combined_summary
except Exception as e:
logger.error(f"Error occurred during summarization: {e}")
return None
The Scraper Method
To fetch the content from the website, we’ve got another method scrape_website()
. It sends a GET request to the provided URL and if the response is a success (HTTP status 200), it uses BeautifulSoup to parse the HTML.
def scrape_website(url):
logger.info(f"Scraping {url}...")
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
relevant_text = []
elements = soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'strong', 'em'])
for element in elements:
text = element.get_text().strip()
if len(text) > 20 and not re.search(r'^\d+\s*$', text):
relevant_text.append(text)
if relevant_text:
article_text = ' '.join(relevant_text)
return article_text
else:
logger.error("Failed to find relevant text.")
return None
elif response.status_code == 404:
logger.error(f"Website not found: {url}")
return None
else:
logger.error(f"Failed to fetch website. Status code: {response.status_code}")
return None
Adjustments maybe needed to the script above depending on the site you are intending to scrap.
url = "https://www.example.com/news/article"
print(summarize_news(url, summarizer))
And there you have it! An automatic article summarizer built with Python. Feel free to tweak and customize it according to your needs. Happy coding!
Drowning in data but not sure how to make sense of it all? You’re not alone, and I am here to help! At Epoch Insights, I turn your data into actionable insights that drive decision-making. Don’t wait for tomorrow to unlock the power of your data. Start your journey towards data enlightenment by booking a consultation! I can’t wait to help you thrive in this data-driven world.