Putting it together - from data scraping to running the model on AWS

with 80,000 images, the headache begins. for convenience sake (and my own clarity lol), here is the workflow below. pardon the extra step in extracting just the links #regret - it is possible to download the images directly with scrapy.

1) Extract links from Wikipedia This saves all the links into ‘items.csv’ in a tab delimited csv, using my spider called ‘myspider.py’.

scrapy runspider myspider.py -o items.csv -t csv

2) Download pictures from links, crop and resize (50x50) and upload to AWS

python3 uploading.py

3) Resize (28x28) and resave it to another bucket in AWS

python3 resize_28_upload.py

4) Save pictures to an array in AWS

python3 pickle_img_array.py

5) Run test model on AWS

python3 POC_adapted_28_aws.py