Automated Web Scraping Using Flickr API With Dynamic Dashboards For Analysis

Tianhao Wu
12 min readMar 14, 2019

--

Introduction

When it comes to photography, people keep asking what cameras and lenses are worth buying and how to take good pictures. These are the questions that could never be answered. Particular cameras are designed for particular purposes whereas the sense of a good photo is not only affected by the high resolutions of a camera or the high optical quality of a lens but also the scene that expresses the motion of whomever behinds that camera.

In this blog, we will explore the questions using data.

The code used in this blog can be found here:

https://github.com/wutianhao910/automated_web_scraping

1. Project objective

The project focuses on building a web scraper for collecting photographs and the key elements behind such as topics, equipment used, camera settings, geographic locations, etc. The collected data will then be analyzed and visualized to give the intuition of which kinds of equipment are applicable for given topics.

2. Introduction to web scraping

Web Scraping is the process of extracting information and data from a website, transforming the information on a webpage into structured data for further analysis. Web scraping is also known as web harvesting or web data extraction. With the overwhelming data available on the internet, web scraping has become an essential approach to aggregating Big Data sets.

Web pages are built using text-based markup languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, toolkits that scrape web content were created. A web scraper is an Application Programming Interface (API) to extract data from a website.

Scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page. Therefore, web crawling is one of the main components of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.

3. Website

The website we are going to explore is Flickr. It is an image hosting service and video hosting service. The service is widely used by photo researchers and by bloggers to host images that they embed in blogs and social media. As one of the most professional and authoritative websites in the photography field, the data on the website can reflect the digital camera market relatively accurately.

4. Project workflow

Website Exploration

The home page of Flickr is https://www.flickr.com.

From the home page, we can search for certain tags (or topics). First, we randomly type in a tag and see what happens.

We see the pictures of Toronto here. The URL changes from https://www.flickr.com to https://www.flickr.com/search/?text=Toronto. By clicking on one of the pictures, we can see the details of it.

Fortunately, all the data we need from a picture are on the same page. Now that we locate the target page, we need to decide which tags we are going to crawl. A tricky thing is that Flickr users can create tag however they like, which cause the problem of having a limited amount of pictures under certain tags.

To avoid bad tags which cannot provide enough data for further use, a brilliant way is to find the most popular tags. These tags can be found in the “Explore” section under “Trending” tab.

We can find the most popular tags on this page, but this ranking is dynamic, so the tags can be changed over time. Instead of directly taking the tags, we can use a simple crawler to collect them in a tag list.

We take “sunset” from the tag list and search for it. The page is redirected to the photo stream page of that tag, for instance, https://www.flickr.com/search/?text=sunset.

One thing needs to be noticed is that this stream page is dynamically loaded. The page keeps loading more pictures as we scroll it down. To tackle this, we can use packages such as Selenium to simulate this action on browser loading pictures as many as possible.

However, the webpage code is somehow encrypted to prevent from crawling. The URL cannot be obtained by the crawler despite we can see the source code when inspecting the page. Fortunately, Flickr has its own API for developers to get access to almost everything on the website so we will use this API to crawl the pictures under each tag.

Data Scraping

1. Tags

The tags can be scraped from the page that has the most popular tags. Because this page is statically loaded, we can use BeautifulSoup to build the crawler to obtain top ten tags. The texts of tags are under class “overlay”.

2. Photo IDs

Now that we have the tags, we can start to scrap the data from pictures by searching each tag. According to Flickr API documentation, a photo ID is indispensable to use its built-in functions. Therefore, the priority is to get photo IDs. To walk through all the photos under a given tag, the function used is FlickrAPI.walk(). First, we test it on Flickr API website with a given tag “sunset”.

Notice that the tag “sunset” has over 4.3 million results. The crawling process of this tag could take days or even months due to the limitation of request frequency on the website. Besides, these results contain not only photos but also irrelevant things such as screenshots, videos, and others. We then set four parameters:

Then we take the photo IDs and views out and store them into a .csv file.

Somehow a picture can show up more than once with the same or a different number of views in the returned results. One possible reason could be the picture was shared in other places or collected in some galleries. Thus, the results with the same ID but a different number of views should be aggregated as a summation whereas the results with the same ID as well as the same number of views should be considered as redundancy.

3. EXIFs and Geolocations

After getting photo IDs for the top ten tags, we can then crawl the detailed camera settings and the geolocations using other two functions in Flickr API.

(1). EXIFs

EXIF (or Exchangeable image file format) is a standard that specifies the formats for images used by digital cameras (including smartphones). In addition to the image itself, EXIF also records the detailed camera settings of that image.

The data we need to extract are simply the model of camera and lens as well as basic camera settings:

a) Camera make

b) Camera model

c) Lens Model

d) Shutter speed

e) Aperture

f) ISO speed

g) Focal Length

h) Color Space

i) Exposure program

j) Metering Mode

k) Flash

To extract the data above from an EXIF file, we use FlickrAPI.photos.getExif() function and store the data into a separate .csv file.

(2). Geolocations

Geolocation is the physical place where a photo was taken. It has been collected in terms of longitude and latitude. With these geographic coordinates, we can plot the data on a map and apply a geographical analysis to see where is the best place for taking photos with a certain topic.

To obtain the longitude and latitude of a photo, we use FlickrAPI.photos.geo.getLocation() function and store the data into a separate .csv file.

4. Merging together

Now that we have all the data we planned to scrape which are separated into three .csv files:

1) Tags.csv

2) EXIFs.csv

3) Geolocations.csv

The next step is to merge these files in one for analysis. An EXIF file might be hidden up from the public sometimes since it may contain private information of the owner, which result in the absence of some columns regarding this section. So does the geolocation. However, the null value cannot be simply dropped because the EXIF file and the geolocation of a photo are not necessarily missing simultaneously. We still need all the EXIF data to analyze the digital camera market and all the coordinates to analyze the geographical distribution. As a consequence, we will keep everything for now.

Data Analysis

The key elements for taking good pictures are equipment and topics (or tags in this case). The top concerns of equipment are cameras and lenses whereas that of topics are locations. Therefore, the analysis will be focused on two aspects:

1) Tag — Camera — Lens:

To analyze which cameras and lenses are used frequently regarding each tag.

2) Tag — Geolocation:

To analyze which locations are the hottest for certain tags.

Instead of taking the number of pictures, we consider the number of views as metrics. The reason is that we want to know what makes good pictures other than what make most pictures.

1. Tag — Camera — Lens

(1). Brands ranking

The first thing is to explore the general camera manufacturers popularity. Same to the real-life experience, the top three brands are CANON, NIKON, and SONY.

Figure 1.1 — Brands Ranking

(2) Models ranking

Then, we take the top three brands and plot the cameras popularity and lenses popularity. It can be found that CANON has a few outstanding models that are used a lot wider than that of NIKON while the top lenses used for taking pictures are mostly “EF” lenses which are from CANON.

(a). Camera models

Figure 1.2 — Camera Models

(b). Lens models

Figure 1.3 — Lens Models

From the overall analysis, it is fair to say that CANON is the most popular manufacturer in the DLSR camera market. The most popular camera model is EOS 5D MARK III, and the most popular lens model is EF 24–105 f/4L IS USM.

(3). Tags proportion

The following pie chart shows the proportion of total views under each tag. Notice that the top one tag on the Flickr website is not the same as what in the chart. It is because that the website ranks by the number of pictures while the chart compares the number of views.

Figure 1.4 — Tags Proportion

(4). Joint analysis

As for now, we have analyzed the data in brands, models, and tags separately, the next step is to explore the usage of cameras and lens with a given tag because different topics require different cameras and lenses.

Figure 1.5 — Equipment Dashboard — Visualized using Tableau

The figure above shows a dashboard built in Tableau. With some actions attached, we can select certain tags and brands to discover the cameras and the lenses that can serve the shooting purpose. As an illustration, we will select “sunset” from the tags chart and “CANON” from the brands' chart.

(a). Select a tag

Figure 1.6 — Equipment Dashboard (tag selected) — Visualized using Tableau

(b). Select a brand

Figure 1.7 — Equipment Dashboard (brand selected) — Visualized using Tableau

2. Tag — Geolocation

(1). Geographical proportion

Now that we discovered the correlation between tags and equipment, the next step is to analyze the geographical distribution of each tag. First, we change the previous pie chart to show the proportion of total views under each country instead of under each tag.

Figure 2.1 — Geographical Proportion

It can be found that more than half of total views are from pictures taken in Germany, UK, US, and Italy which are 17.23%, 16.57%, 13.25%, and 9.24%, respectively.

(2). Map

Then we put all the pictures we have on a map to have an intuitive understanding of the distribution at the geographical level.

Figure 2.2 — Pictures Map

(3). Joint analysis

After plotting the pie chart and the map, a joint dashboard can be created to explore the correlation between tags and geolocations.

Figure 2.3 — Geolocation Dashboard — Visualized using Tableau

The dashboard shown above is also built in Tableau. It allows us to select certain countries and tags to discover the geographical distribution of each tag in each country. As an illustration, we will select “United Kingdom” from the countries to see the distribution of all tags, then select “sunset” from the tags to see the difference.

(a). Select a country

Figure 2.3 — Geolocation Dashboard (country selected) — Visualized using Tableau

(b). Select a tag

Figure 2.3 — Geolocation Dashboard (tag selected) — Visualized using Tableau

(c). Select a picture

Note that every cross mark on the map represents an actual picture. If we select a picture from the map, we can find the camera settings and other details of the picture.

Figure 2.3 — Geolocation Dashboard (picture detail) — Visualized using Tableau

Summary

In this project, we built a web scraper for collecting data from Flickr’s website. The web scraper was written in Python using BeautifulSoup to obtain the hottest tags (or topics) and Flickr API to crawl the EXIF file of each picture (detailed camera settings and the geolocations).

Next, we visualized the collected data in Tableau and built a few dashboards. The data were analyzed from two aspects which are equipment usage and geographical distributions. We then made the following conclusions:

1. Western Countries seem like good places to take good pictures:

2. Top 10 topics on Flickr:

3. Top three brands on Flickr are CANON, NIKON, and SONY.

4. Top three camera models are all CANON’s products which are a lot more popular than NIKON’s cameras.

a) CANON EOS 5D MARK III

b) CANON EOS 6D

c) CANON EOS 5D MARK IV

5. Top three of NIKON’s camera models are:

a) NIKON D5300

b) NIKON D850

c) NIKON D810

6. The most commonly used lenses are mostly from CANON. The top three lenses are:

a) EF 24–105mm f/4L IS USM

b) EF 16–35mm f/4L IS USM

c) EF 24–70mm f/2.8L II USM

--

--

Responses (2)