Async web-scraping

19 Aug 2018

So I was trying to do some async web scraping…

The code I ended up with:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

ids = ['1', '2', '3']

async def fetch(session, id):
    print('Starting {}'.format(id))
    url = f'https://www.testing.com/{id}'

    async with session.get(url) as response:
        return BeautifulSoup(await response.content, 'html.parser')

async def main(id):
    async with aiohttp.ClientSession(trust_env=True) as session:
        soup = await fetch(session, id)
        if 'No record found' in soup.title.text:
            print(id, 'na')


loop = asyncio.get_event_loop()
future = [asyncio.ensure_future(main(id)) for id in ids]


loop.run_until_complete(asyncio.wait(future))

In windows, you have to set the HTTP_PROXY in the commandline (it gets deleted once you close the window): set HTTP_PROXY=http://user:[email protected]:port

In mac, set it in your terminal: export HTTP_PROXY=http://10.20.1.1:8080/

Disclaimer: I believe there are some parts that need to be refined…! but this is what i have for now lol

in the meantime… i will try Scrapy LOL