Media Cloud

Open Source

Media Cloud is an Open Source Project


Media Cloud is a suite of technologies that allow researchers to answer quantitative questions about the content of online media. As an academic research project, Media Cloud is fully committed to being an open source project. This means all our software is written "out-loud" - in public for you to view, engage with, and contribute to. The source-code for our core engine, web-based research support platforms, and many connected libraries are all on GitHub.


The Core Engine

Our core engine collects content and provides web-based tools for doing research on it.  People have spun up their own installations of Media Cloud to do their own research, but it is far easier to just add your content and needs to our main hosted installation (so others can benefit as well).


Media cloud

Our core application is a pipeline that collects stories from across the web, processes them, stores them, and makes them available via an API.  This is a large amount of Perl and Python code, connected to Postgres and Solr databases.

Online Web Applications

Our web-based research support tools pull data via the API and provide reports, visualizations, and searching in a variety of ways. They are written in Javascript and Python, using React, Redux, and Flask.


Associated Utilities

While we've developed out core engine, a number of smaller projects have spun off as useful utilities that others can use, with or without Media Cloud. We've published those back to the community.



We do entity-extraction and geoparsing via our CLIFF-CLAVIN tool. We built it to identify and disambiguate references to places in news articles.  This is written in Java and builds on top of the CLAVIN project.

Media Cloud API Client

Researchers who want lower-level access to the data Media Cloud provides can use our python API client library.  This is the library all of our online web applications use, and what we use internally to drive research in Jupyter notebooks.

Feed Seeker

The main way Media Cloud ingests stories is by fetching RSS feeds.  For each source we track a list of feeds to pull stories from.  Feed Seeker is a Python library for discovering any RSS, ATOM, XML, and RDF feeds that might be associated with any arbitrary web URL.

Date Guesser

Determining the date of content published on the web is a hard problem. This is a Python library to extract a publication date from a web page, along with a measure of the accuracy.


Hausa Stemmer

Media Cloud supports many languages. This Python library lets us stem content in the Hausa language.  It is a reference implementation by Bimba et al., 2015.

Catalan SnowBall Stemmer

Media Cloud support many languages. This Perl library is an interface to the Snowball stemmer for the Catalan language.


Multilingual Sentence Splitter

Our system splits text content into sentences for analysis in multiple languages.  This is a Python port of the Lingua::Sentence Perl module.

NYT NewS Labeler

We run all our English stories through a set of trained models to detect what theme(s) they focus on. To build these models, we took the approach of transfer learning - starting with the Google News word2vec models and then adapting them to produce based on the New York Times annotated corpus.