Python Programming Tutorial – 25 – How to Build a Web Crawler (1/3)

Facebook – https://www.facebook.com/TheNewBoston-464114846956315/
GitHub – https://github.com/buckyroberts
Google+ – https://plus.google.com/+BuckyRoberts
LinkedIn – https://www.linkedin.com/in/buckyroberts
reddit – https://www.reddit.com/r/thenewboston/
Support – https://www.patreon.com/thenewboston
thenewboston – https://thenewboston.com/
Twitter – https://twitter.com/bucky_roberts

Comments

Eustace Benson says:

i installed beautifulsoup4 but the codes wont run.i get this red underline on my import requests

Harsh raj Solanki says:

hi my code is executing without any error but i.m not getting any output can somebody help ??????

Joydeep Roychowdhury says:

yo guys import request u can find only from urllib so try “from urllib import request”

sina cengiz says:

Hey bucky I think ur page doesnt exist anymore

David says:

I made a web crawler with beautiful soup(20 lines). The logic is: harvest links, append links, iterate through list and append forever. I plan on screening the websites for some things in the am. I’ll post what i do with them on my blog. I just wanted to share this function.

print ‘n’
print ‘-‘*25+’Web Crawler’+’-‘*25
print ‘n’

places = []

def crawl(url):
# harvest and store links
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page,”lxml”)
for link in soup.findAll(‘a’):
places.append(link.get(‘href’))

# harvest and iterate forever
while 1:
for a in places:
try:
html2 = urllib2.urlopen(a)
soup2 = BeautifulSoup(html2, “lxml”)
for b in soup2.findAll(‘a’):
places.append(b)
print b.get(‘href’)
except KeyboardInterrupt:
print ‘Crawler Stopped…’
time.sleep(2)
print ‘Shutting Down’
return
except:
continue

NAREK ANTHONY says:

Hi Bucky,

Your website doesn’t exist any more, any website I tried didn’t work, I thought every code I could but the output was: error. I need some advice here and either its you or any other who reads this comment please don’t write try another website because every other website I tried, the inspect element was different from yours.

Thank You, Any advice will be appreciated.

Melvin Vijay says:

what is the difference between
1)urllib.request.urlretrieve()
2)request.urlopen()
3)requests.get()

Because in three tutorials you have used three different ways to access the url or webpage.Why can’t we use the same request?
Thank you.

Salim KHALIL says:

if you need to build your own crawler refer to this paper
https://doi.org/10.1016/j.softx.2017.04.004

Trust Oppa says:

i downloaded beautifulsoup4 but it just wont show when i put import. is there anyone who has the same problem?

Marieke Venselaar says:

Question: at 3:26 Bucky right-clicks mouse, gets a menu and selects view page code. My computer does not do that. BeautifulSoup4 is installed at my computer, and when I copy-paste Dark Seid’s code (below in one of the other reactions) everything works.

My question basically is: how can I view a page code? (Right-click mouse does not work)

Hopefully somebody can help me! thanks in advance!

Martinez Darlene says:

Hey ,you can try the best web Crawler–ContentBomb

https://plus.google.com/115670926276317481719/posts/5348DZH6X8R

Da-mel Melton says:

how did you upload and create your website ?

Alan Lee says:

what is beatifulsoup again?

dryaser the best says:

when i go to settings /interpreter add new it gives me blank panel with no options to choose from, what’s wrong?

Elyria says:

I’ve been lied to!!! THE CAKE IS A LIE!!! There is no buckysroom…

Bhageerath Bogi says:

error loading package list:pypi.python.org – please help

TakeMyBacon says:

when I have try to import BeautifulSoup, this error message comes up, here is my code:
import requests
from bs4 import BeatifulSoup

The error message:

Traceback (most recent call last):
File “C:UsersShaanDesktopcps.py”, line 2, in
from bs4 import BeautifulSoup
File “C:UsersShaanAppDataLocalProgramsPythonPython35-32libsite-packagesbs4__init__.py”, line 30, in

from .builder import builder_registry, ParserRejectedMarkup
File “C:UsersShaanAppDataLocalProgramsPythonPython35-32libsite-packagesbs4builder__init__.py”, line 308, in

from . import _htmlparser
File “C:UsersShaanAppDataLocalProgramsPythonPython35-32libsite-packagesbs4builder_htmlparser.py”, line 7, in

from html.parser import (
ImportError: cannot import name ‘HTMLParseError’

DO I have to reinstall bs4?

Romulo Alvarez Garcia says:

could you please tell me how i can get that interface to program in python?

siddharth yadav says:

shouldn’t the next line after the while loop be ” page+=1 ” ? that would actually add the page value by 1 upto the max value page and then we can add that url line after this page+=1 line ??

Elder Root says:

Haven’t finished watching your series yet, but you are using pycharm. So I like you already.

malla 8848 says:

How can we know what we need to import to do certain thing ? And also the way to use it ?

The Millennial Sage says:

question, do i need pycharm to use the beautiful soup module?

Goad Said says:

1. Does anyone know of an acceptable website to crawl? The classes on say, ebay, are not as self-evident as “class = item” and/or they are blocking me from crawling them

2. Is say “max_pages” a built in parameter? Like does python know what you mean without defining it further? Like does it think page 20 is “max_pages? or 50? I had this same question a few tutorials back with “csv_url” when writing the reader because – how does the program know which csv-file containing url you want to open? we passed “csv_url” into the function without ever saying okay csv_url = goog.fgf.csv etc. Did it just automatically assume you wanted to open the link above the user-created function because it was there?

Eustace Benson says:

someone please help i just started learning python

Advait Kulkarni says:

how can I get beautifulsoup 4 package in atom IDE ?

suryathejareddy duvvuri says:

can some one help me how to download modules from internet ,my pycharm is not downloading modules
plzz any help is appreciated

aggelosQQ says:

+thenewboston you should, maybe, give us alternative websites that are pretty much the same as your trade page since it doesn’t exist anymore, just to make it easier for others and make them focus on the important things rather than looking for a website.

Asymptote says:

if you cant install beautifulsoupe4, try running your PyCharm as an administrator in Windows

Belouz Tarek says:

Hello
thank you for the course.please how to calculate the number of tags in an HTML,like (,,,,,….etc)
thanks

Nataly RAW says:

can multiple threads make multiple http requests at the same time or do these connections interfere with each other?

Zhihua Zhou says:

9 pages…

Manik Mahajan says:

beautiful soup cannot be installed?
help

arifattal says:

i cant seem to add packages on pycharm. ive got 3.0.3, older ios.
and dont have that + button to add packages off the web
any help guys?

ninad virkar says:

I cant find you webpage buckysroom/trade

Cangri says:

Hey Bucky! What do I do when at the moment of selecting another page from the website, the url is not altered? Do you understand the question?

DBZGODZ says:

what about game bots is it decently hard to create them or no?

Lee Annabel says:

*BotChief* can create any online web bots without programming or coding. https://plus.google.com/+LewRowland/posts/KCguwM62qsC

Damien Bertrand says:

etsy for anyone else looking for a good url

韩鹏 says:

this word is always little , can you enlarge it ?

Damien Bertrand says:

do you have any idea how fucking hard it is to find a website with page= at the end of the url bucky fix this video for f sake

lol k says:

I can’t find the website what should I do

Salim KHALIL says:

you can also use Rcrawler package, it’s much easier and can crawl & scrape all web pages of a website automatically https://CRAN.R-project.org/package=Rcrawler

billa rauf says:

http://oceanofgames.com/page/2/
for all the people who are asking nicely for a site
and all those fuckers who are abusing

Gaming Bros says:

Need a url to crawl

Krishan Bhadana says:

youtube needs people like you, seriously there are very few people who are sharing their knowledge with the world in such a beautiful manner

SK says:

is web crawler will work on community version of pycharm ide?

vinod kumar says:

do you have django tutorials. if you have why don”t you post?

lich tran says:

install BeautifulSoup and Requests at windows command:

>python -m pip install BeautifulSoup4
>python -m pip install Requests

Celtic Tiger says:

Tried a few changes still wont work 28/04/17

import requests
from bs4 import BeautifulSoup

def motor_spider(max_pages):
page = 1
while page <= max_pages: url = “https://www.donedeal.ie/motorbikes” + str(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text,”html.parser”) for link in soup.findAll(‘a’,{‘class’: ‘body_title’}): href = link.get(‘href’) print(href) page += 1 motor_spider(2)

Write a comment

*

Human Verification: In order to verify that you are a human and not a spam bot, please enter the answer into the following box below based on the instructions contained in the graphic.


Do you like our videos?
Do you want to see more like that?

Please click below to support us on Facebook!

Send this to a friend

▷ Other ReviewsVehicles▷ Show Cars▷ Motorbikes▷ Scooters▷ Bicycles▷ Rims & Tires▷ Luxury BoatsFashion▷ Sunglasses▷ Luxury Watches▷ Luxury Purses▷ Jeans Wear▷ High Heels▷ Kinis Swimwear▷ Perfumes▷ Jewellery▷ Cosmetics▷ Shaving Helpers▷ Fashion HatsFooding▷ Chef Club▷ Fooding Helpers▷ Coktails & LiquorsSports▷ Sport Shoes▷ Fitness & Detox▷ Golf Gear▷ Racquets▷ Hiking & Trek Gear▷ Diving Equipment▷ Ski Gear▷ Snowboards▷ Surf Boards▷ Rollers & SkatesEntertainment▷ DIY Guides▷ Zik Instruments▷ Published Books▷ Music Albums▷ Cine Movies▷ Trading Helpers▷ Make Money▷ Fishing Equipment▷ Paintball Supplies▷ Trading Card Games▷ Telescopes▷ Knives▷ VapesHigh Tech▷ Flat Screens▷ Tech Devices▷ Camera Lenses▷ Audio HiFi▷ Printers▷ USB Devices▷ PC Hardware▷ Network Gear▷ Cloud Servers▷ Software Helpers▷ Programmer Helpers▷ Mobile Apps▷ Hearing AidsHome▷ Home Furniture▷ Home Appliances▷ Tools Workshop▷ Beddings▷ Floor Layings▷ Barbecues▷ Aquarium Gear▷ Safe Boxes▷ Office Supplies▷ Security Locks▷ Cleaning ProductsKids▷ Baby Strollers▷ Child Car Seats▷ Remote ControlledTravel▷ Luggages & Bags▷ Airlines Seats▷ Hotel Rooms▷ Fun Trips▷ Cruise Ships▷ Mexico Tours