# Work with internet resources

Beyond communication, people have been utilizing the internet for content creation and consumption. There are a few ways to obtain these contents.

## Web scraping (HTML parsing)

Technically there are [many techniques](https://en.wikipedia.org/wiki/Web_scraping#Techniques) under the category of web scraping. This section will focus solely on the HTML (short for Hypertext Markup Language) parsing technique, automating what humans would do to ingest information from a website manually.

A semantic understanding of the language enables the foundation of the HTML parsing technique. Regardless of how complex and dynamic the processes are behind the website (or web app), the eventual content is delivered as HTML, plus CSS (short for Cascading Style Sheet) for styling, and usually JavaScript for interactivities.

In Python, we can leverage the open-source framework [Scrapy](https://scrapy.org/) to crawl and scrape data from websites.

### A Canadian University Spider

In [1]:
import json

from scrapy import Spider


# our first "Spider" (that crawls the designated website for us)
class UniversitySpider(Spider):

    name = 'University Spider'
    start_urls = ['https://en.wikipedia.org/wiki/List_of_universities_in_Canada']
    
    custom_settings = {
        'ITEM_PIPELINES': { 'item_pipeline.ItemPipeline': 300 },  # from item_pipeline.py
        'LOG_LEVEL': 'ERROR',
    }

    def parse(self, response):
        rows = response.css('table.wikitable > tbody > tr')

        for row in rows:
            school = row.xpath('td[1]')

            if school.css('a ::text'):
                yield response.follow(school.css('a')[0], self.school_parser)

    def school_parser(self, response):
        school_info = {}
        school_info['name'] = response.css('h1.firstHeading ::text').get()

        school_info['lat'] = response.css('span.latitude ::text').get()
        school_info['lng'] = response.css('span.longitude ::text').get()

        rows = response.css('table.infobox > tbody > tr')
        # fuzzy search
        for row in rows:
            header = row.css('th ::text').get()
            if header:
                school_info[header] = row.css('td ::text').get()

        yield school_info

To make a scraping script, we write a `class` by extending the `scrapy.Spider` base class that abstracts away the underlying process so we can focus on specifics such as:
* The starting website URLs for the "Spider" to crawl.
* Rules-based on HTML and CSS selectors to:
    * Next level links to follow.
    * Parse and pick out essential information we want to collect.

Besides the extension of a base class, another new concept is `yield`, which involves the Python generator mechanism. A generator allows a function (or method) to behave like an iterator, which we can conceptualize as an efficient way of interacting with something like a `list`. You can read more about it on its [Python Wiki entry](https://wiki.python.org/moin/Generators). In short, `yield` behaves very much like `return`, but it may keep going until the iterative or concurrent logic surrounding it exhausts all possible inputs.

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(UniversitySpider)
process.start()

2021-03-05 12:16:36 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-05 12:16:36 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.2 (default, May  5 2020, 15:52:07) - [Clang 11.0.0 (clang-1100.0.33.17)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j  16 Feb 2021), cryptography 3.4.6, Platform macOS-10.16-x86_64-i386-64bit
2021-03-05 12:16:36 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-05 12:16:36 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 'ERROR'}


In [3]:
import pandas as pd

# load university data into a Pandas DataFrame
df = pd.read_json('./universities.json')
df

Unnamed: 0,name,lat,lng,Former names,Type,Established,President,Academic staff,Administrative staff,Students,...,Tenants,Academic affiliation,Commandant,Call signs,Athletic teams,Principal and Vice-Chancellor,Tag line,Vice-president,Public transit,Faculty
0,Alberta University of the Arts,51°03′43″N,114°05′29″W,\n,Public,1926,Dr. Daniel Doz,145,95,1323,...,,,,,,,,,,
1,University of Victoria,48°27′48″N,123°18′42″W,Victoria College,Public university,"July 1, 1963",Kevin Hall,914 faculty,"5,251 employees",21696,...,,,,,,,,,,
2,University College of the North,53°49′11″N,101°14′16″W,Keewatin Community College (1966-2004),University college,"July 1, 2004 as University College of the North",Doug Lauvstad,,Approximately 400,"Approximately 2,400",...,,,,,,,,,,
3,University of Winnipeg,49°53′24.44″N,97°9′12.12″W,,Public,"1871 Manitoba College. Subsequent names, Wesle...",Annette Trimbee,305,494,9419,...,,,,,,,,,,
4,University of Northern British Columbia,53°53′14.40″N,122°48′49.40″W,,Public university,1990,Geoffrey Payne (Interim),,,3570 (2019/2020),...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,MacEwan University,53°32′49″N,113°30′17″W,"Grant MacEwan University, Grant MacEwan Colleg...",Public University,1971,Annette Trimbee,972,,19101,...,,,,,,,,,,
89,University of Lethbridge,49°40′00″N,112°51′50″W,,Public,1967,Michael J. Mahon,491,,,...,,,,,,,,,,
90,University of Calgary,51°04′39″N,114°07′59″W,,Public,26 April 1966,Ed McCauley,"1,848\n",3116,31950,...,,,,,,,,,,
91,University of Alberta,53°31′28″N,113°31′28″W,,Public,1908,Bill Flanagan,2764,2527,,...,,,,,,,,,,


In [4]:
# convert DMS (Degrees-Minutes-Seconds) format to pure numerical decimal point format
# decimal = (degrees + minutes / 60 + seconds  / (60 * 60) * (-1 if S or W else 1)
def decimal_coord(dms):
    try:
        degs, parts = dms.split('°')
        mins, parts = parts.split('′')
        try:
            secs, sign = parts.split('″')
        except:
            sign = parts
        sign = -1 if sign in ['S', 'W'] else 1
        return (float(degs) + float(mins) / 60 + float(secs) / (60 * 60)) * sign
    except:
        return 0


# apply the mutation to the DataFrame
df['lat'] = df['lat'].map(decimal_coord)
df['lng'] = df['lng'].map(decimal_coord)
df

Unnamed: 0,name,lat,lng,Former names,Type,Established,President,Academic staff,Administrative staff,Students,...,Tenants,Academic affiliation,Commandant,Call signs,Athletic teams,Principal and Vice-Chancellor,Tag line,Vice-president,Public transit,Faculty
0,Alberta University of the Arts,51.061944,-114.091389,\n,Public,1926,Dr. Daniel Doz,145,95,1323,...,,,,,,,,,,
1,University of Victoria,48.463333,-123.311667,Victoria College,Public university,"July 1, 1963",Kevin Hall,914 faculty,"5,251 employees",21696,...,,,,,,,,,,
2,University College of the North,53.819722,-101.237778,Keewatin Community College (1966-2004),University college,"July 1, 2004 as University College of the North",Doug Lauvstad,,Approximately 400,"Approximately 2,400",...,,,,,,,,,,
3,University of Winnipeg,49.890122,-97.153367,,Public,"1871 Manitoba College. Subsequent names, Wesle...",Annette Trimbee,305,494,9419,...,,,,,,,,,,
4,University of Northern British Columbia,53.887333,-122.813722,,Public university,1990,Geoffrey Payne (Interim),,,3570 (2019/2020),...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,MacEwan University,53.546944,-113.504722,"Grant MacEwan University, Grant MacEwan Colleg...",Public University,1971,Annette Trimbee,972,,19101,...,,,,,,,,,,
89,University of Lethbridge,49.666667,-112.863889,,Public,1967,Michael J. Mahon,491,,,...,,,,,,,,,,
90,University of Calgary,51.077500,-114.133056,,Public,26 April 1966,Ed McCauley,"1,848\n",3116,31950,...,,,,,,,,,,
91,University of Alberta,53.524444,-113.524444,,Public,1908,Bill Flanagan,2764,2527,,...,,,,,,,,,,


In [5]:
# plot it on a map visualization
import folium

m = folium.Map()

def plot(map):

    def fn(row):
        return folium.CircleMarker(
            location=[row['lat'], row['lng']],
            radius=10,
            fill_color='blue',
            popup=row['name'],
        ).add_to(map)

    return fn

# functionally emulate what can be done iteratively
df.apply(plot(m), axis=1)

m

## Web APIs

API stands for Application Programming Interface. Such interfaces expose the underlying abstractions of a computer system or software stacks in a controlled manner.

Web APIs involve the abstractions made available through web servers. The most widely adopted protocol to expose web APIs is HTTP (Hypertext Transfer Protocol).

Technically, web API consumption is a more efficient means to perform web scraping. Instead of parsing the output HTML intended for humans, web APIs usually respond with raw data that is more suitable for direct programming.

We will use the [HH (household) Segments API](https://docs.eqworks.io) from our LOCUS product for our examples below.

In [6]:
import os
import requests

# obtain a valid EQ API JWT of your own
with open(os.path.expanduser('~/.locussdk/.token')) as f:
    JWT = f.read().strip()

headers = {'eq-api-jwt': JWT}

coords = [
    '42.9885,-81.2270',
    '45.4118,-75.7146',
    '43.9153,-78.8869',
    '46.5012,-81.0069',
    '43.5442,-79.6032',
    '43.7608,-79.5757',
    '43.8239,-79.0855',
    '43.7561,-79.4046',
    '43.7880,-79.4464',
]
params = {'coords[]': coords}

req = requests.get('https://api.locus.place/prod/segment/hh-segments', headers=headers, params=params)
data = req.json()
hh_df = pd.DataFrame(data)
hh_df

Unnamed: 0,lat,long,segments
0,42.9885,-81.227,{'school_students': 0.019199999999999995}
1,45.4118,-75.7146,"{'school_students': 0.03839999999999999, 'univ..."
2,43.9153,-78.8869,"{'school_students': 0.272, 'movie_goer': 0.328..."
3,46.5012,-81.0069,"{'university_student': 0.02400000000000002, 's..."
4,43.5442,-79.6032,{'school_students': 0.6320000000000001}
5,43.7608,-79.5757,{'movie_goer': 0.6272000000000001}
6,43.8239,-79.0855,"{'school_students': 0.85, 'movie_goer': 0}"
7,43.7561,-79.4046,"{'frequent_traveler': 0.10240000000000005, 'mo..."
8,43.788,-79.4464,{'frequent_traveler': 0.16000000000000003}


In [7]:
# pick out primary segment (segment with highest score)
def find_primary(segments):
    score = 0
    primary = None
    for k, v in segments.items():
        if v > score:
            primary = k
            score = v

    return primary

hh_df['primary'] = hh_df['segments'].map(find_primary)

# get primary segment score
hh_df['score'] = hh_df.apply(lambda row: row['segments'][row['primary']], axis=1)

hh_df

Unnamed: 0,lat,long,segments,primary,score
0,42.9885,-81.227,{'school_students': 0.019199999999999995},school_students,0.0192
1,45.4118,-75.7146,"{'school_students': 0.03839999999999999, 'univ...",university_student,0.0832
2,43.9153,-78.8869,"{'school_students': 0.272, 'movie_goer': 0.328...",movie_goer,0.328
3,46.5012,-81.0069,"{'university_student': 0.02400000000000002, 's...",school_students,0.432
4,43.5442,-79.6032,{'school_students': 0.6320000000000001},school_students,0.632
5,43.7608,-79.5757,{'movie_goer': 0.6272000000000001},movie_goer,0.6272
6,43.8239,-79.0855,"{'school_students': 0.85, 'movie_goer': 0}",school_students,0.85
7,43.7561,-79.4046,"{'frequent_traveler': 0.10240000000000005, 'mo...",frequent_traveler,0.1024
8,43.788,-79.4464,{'frequent_traveler': 0.16000000000000003},frequent_traveler,0.16


In [8]:
# map 'em up
hh_m = folium.Map(location=[43.651890, -79.381706], zoom_start=6)


def get_color(seg):
    return {
        'school_students': 'blue',
        'university_student': 'red',
        'movie_goer': 'yellow',
        'frequent_traveler': 'purple',
    }[seg] or 'grey'


def format_popup(row):
    s = f"<h4>{' '.join(row['primary'].split('_'))}</h4>"
    s += f"<p>Coord: {row['lat']}, {row['long']}</p>"
    if len(row['segments']) > 1:
        s += '<ul>'
        for k, v in row['segments'].items():
            s += f'<li>{k}: {v}</li>'
        s += '</ul>'
    return folium.Popup(s, max_width=500)


def plot(map):

    def fn(row):
        color = get_color(row['primary'])
        return folium.CircleMarker(
            location=[row['lat'], row['long']],
            color=color,  # use a pre-determined colors
            fill_color=color,
            fill_opacity=row['score'], # scale opacity against primary segment score
            popup=format_popup(row),
        ).add_to(map)

    return fn

# functionally emulate what can be done iteratively
hh_df.apply(plot(hh_m), axis=1)

hh_m

## SDKs

Technology providers usually provide SDKs (Software development kit) that abstracts away generic application development tasks. SDKs typically bring an extra layer of convenience for the users on top of the APIs it implements.

An example is our [LOCUS SDK](https://eqworks.github.io/locussdk).

It starts by helping with obtaining the necessary token to access LOCUS APIs:

```
% locus login
LOCUS user: leo.li@eqworks.com
Login passcode sent to leo.li@eqworks.com through email
Login passcode received (*): 
leo.li@eqworks.com successfully logged in. Token persisted at ~/.locussdk/.token
```

And more, such as the previous example to consume the HH segments API:

In [9]:
from locussdk import get_hh_segments

data = get_hh_segments(coords)
hh_df = pd.DataFrame(data)
hh_df

Unnamed: 0,lat,long,segments
0,42.9885,-81.227,{'school_students': 0.019199999999999995}
1,45.4118,-75.7146,"{'school_students': 0.03839999999999999, 'univ..."
2,43.9153,-78.8869,"{'school_students': 0.272, 'movie_goer': 0.328..."
3,46.5012,-81.0069,"{'university_student': 0.02400000000000002, 's..."
4,43.5442,-79.6032,{'school_students': 0.6320000000000001}
5,43.7608,-79.5757,{'movie_goer': 0.6272000000000001}
6,43.8239,-79.0855,"{'school_students': 0.85, 'movie_goer': 0}"
7,43.7561,-79.4046,"{'frequent_traveler': 0.10240000000000005, 'mo..."
8,43.788,-79.4464,{'frequent_traveler': 0.16000000000000003}
