Python Programming Tutorial – 27 – How to Build a Web Crawler (3/3)

Facebook – https://www.facebook.com/TheNewBoston-464114846956315/
GitHub – https://github.com/buckyroberts
Google+ – https://plus.google.com/+BuckyRoberts
LinkedIn – https://www.linkedin.com/in/buckyroberts
reddit – https://www.reddit.com/r/thenewboston/
Support – https://www.patreon.com/thenewboston
thenewboston – https://thenewboston.com/
Twitter – https://twitter.com/bucky_roberts

Comments

conjuring official says:

It gives me very big error when I tried this bunch of code with olx website because buckysroom is no longer available.
Please help

TheNidoking says:

If you are still getting errors regarding Connection, you need to do something called exception handling.

try:

except requests.exceptions.ConnectionError:
pass
NassesSchnitzel says:

where is the source code ?

New Bee says:

def download_from(url):
page_source = BeautifulSoup(requests.get(url).text)

for line in page_source.findAll(‘a’,{‘class’: ‘product-name’}):
href = line.get(‘href’)
print(href)

download_from(“https://www.eshop.dz/144-reception-satellite”)

Kiarash Kashani says:

why does my program always output this?! =

C:UsersDIGISUBPycharmProjectsuntitled2venvScriptspython.exe C:/Users/DIGISUB/PycharmProjects/untitled2/MAIN.py
C:UsersDIGISUBPycharmProjectsuntitled2venvlibsite-packagesbs4__init__.py:181: UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“html.parser”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 32 of the file C:/Users/DIGISUB/PycharmProjects/untitled2/MAIN.py. To get rid of this warning, change code that looks like this:

BeautifulSoup(YOUR_MARKUP})

to this:

BeautifulSoup(YOUR_MARKUP, “html.parser”)

markup_type=markup_type))

Process finished with exit code 0

AnnesoJ says:

I everyone,

I am new with BeatifulSoup.
I manage to get the price of a chair on ikea’s website easily.

Now I want to get the number of player on a boardgame but I do not manage to do that….
I ask my question here:
https://stackoverflow.com/questions/45478237/beautifulsoup-get-information-from-website-parent-child-issue

If anyone can help that would be amazing :-)

sammywammy says:

does anyone know if craigslist has blocked web crawling because i have used it and cannot get the item data anymore

MrPaceTv says:

So i was using this site http://books.toscrape.com/catalogue/category/books_1/page-1.html
but my print(title) where title=link.string had 2 issue
*1.* It was printing lots of spaces
*2.* It was printing *None* a lot
so i modified my for loop to this
if not title: # this line checks if title is empty. empty string has a boolean value of false
continue # if title string is empty it continues otherwise it goes to else
else:
print(title.strip()) # *strip()* removes the spaces from the title string , for left spaces its *lstrip()* and for right *rstrip()* and *strip()* for both side spaces

atur94 says:

I wrote simple reddit crawler. If you want you can find it here https://github.com/atur94/SimpleRedditCrawler

Java-Tutorial says:

Very informative …

MorbanJunior says:

can you update in this time, how to buil a web crawler tutorial 1,2,3 , becouse thoses tutorial was 3 years ago.

Laksh Sethi says:

I can’t seem to open the forum page of his where all codes are there that he says,can anyone give me the link please

Ravi Srinivasan says:

Getting the title and description of people in TheNewBoston:
def boston_spider(max_pages):
page = 1
while(page < max_pages): url = “https://thenewboston.com/search.php?type=1&sort=pop&page=” + str(page) source = requests.get(url) plaintext = source.text soup = BeautifulSoup(plaintext, “html.parser”) for link in soup.findAll(‘a’, {‘class’:’user-name’}): href = “https://thenewboston.com/” + link.get(‘href’) get_single_item_data(href) #title = link.string page += 1 def get_single_item_data(item_url): source = requests.get(item_url) plaintext = source.text soup = BeautifulSoup(plaintext, “html.parser”) for link in soup.findAll(‘p’, {‘id’:’page-description’}): desc = link.string for link in soup.findAll(‘h1’, {‘id’:’page-title-span’}): title = link.string print(str(title)+” is about “+ str(desc)) boston_spider(2)

zain mohammad says:

how did u use href as url to get the item names in this 3rd tutorial :(

Gera Sanz says:

(English not my native) i just loved these tutorials, the way Bucky explains and everything, i know its quite old, and buckys page exists no more, but that pushed me to practice with a Website, giving me more challenges, and having to make more research, for obvious reasons, and know i have my own web crawler adjusted to another web Page, with a whole different settings, like bringing not the links but the text inside a child of an specific condition and stuff, and i’m so excited right now, and maybe it’s not that much for most of you, but for me that im not near to be a programmer, (i studied Economics), i feel like a hacker

Lakshya Sethi says:

Amazing tutorial! I got an assignment to make a web crawler and create tf-idf on the results. This gave me a headstart! Thank you so much!

S Sanyus says:

Hi, how can i download the images?

Subhankar Das says:

bucky i cnt find your website.

Cheese Cake says:

Please show us how to do it with sets.

Austin Cain says:

LOL, Did anyone do this to the titles on “thenewboston”? I saw it say, “I am programmer, I have no life”. I thought that was hilarious.

P.S 2005 says:

please help me with this and tell me why is this not working?–

import requests
from bs4 import BeautifulSoup

def youtube_video_search(video):
url = ‘https://www.youtube.com/results?search_query’ + str(video)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, “html.parser”)
for the_video in soup.find_all(‘a’, {‘id’: ‘video-title’}, {‘class’: ‘yt-simple-endpoint style-scope ytd-video-renderer’}):
title = the_video.string
href = the_video.get(‘href’)
print(title)
print(href)

youtube_video_search(‘python+coading’)

result —

F:pythonPythonpython.exe F:/python/PycharmProjects/Project1/My_functions.py

Process finished with exit code 0

New Bee says:

Practice makes perfect guys
Bucky thank you SO MUCH, honestly u the best

guys i used this website to practice : https://www.eshop.dz/144-reception-satellite

Nitin Dogra says:

i am getting error of your code on pycharm
urllib.error.HTTPError: HTTP Error 401: Unauthorized
what to do ?
help…

Jon Poulter says:

Hey Bucky – firstly, fantastic, clear and easy-to-follow tutorials. I’ve just stumbled across them but thanks so much already!
However, something you mentioned at the end of this one has tripped me up and I’ve not been able to dig up a solution. I know these videos are fairly old now, but how would you go about, in this example, adding the urls into a set to quickly and simply strip out the duplicates?

BharatPutra says:

I am getting a warning.

A G says:

That was really good job. But how about images?

Harshith Chukka says:

i don’t uderstand , how were you able to call the second function if it is defined afterwards.
For me it shows get_single_item_data(href) – NAME NOT DEFINED

Xavier van Gorp says:

has anyone done the code for the web crawler to extract it’s own code from the website?

Yasmin Khalil says:

i have a question , since the 2 for loops are after each others shouldn’t the results of the first come then the results of the 2nd? i mean, shouldn’t all the item names come first then the urls? why each item name is followed by the links in the page?

Syed Muhammad Akbar says:

No idea what happened in the last three videos, but it was cool! Time to get down to business.

shipika singh says:

hey boston , thanks for this tutorial.
can you please tell how i can create a huge document dataset using this and can also add duplicates of most of the links. i want to create a data set with duplicate documents and are distributed into categories for running into my model(with approx 2000 entries). reply ASAP

Filip Dimitrievski says:

Here some interesting idea..

import requests
from bs4 import BeautifulSoup
import random
import urllib.request

def download_image(url):
name = random.randrange(1, 1000)
full_name = str(name) + “.jpg”
urllib.request.urlretrieve(url, full_name)

def download_all_photos(max_pages):
page = 1
while page <= max_pages: url = urll + str(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, “html.parser”) for image in soup.findAll(‘img’, {‘class’: ‘img’}): download_image(image[‘src’]) urll = “SOME EBAY SEARCH URL WITHOUT THE LAST NUMBER” download_all_photos(1)

ZeemanLive says:

for those of you confused with this tutorial, i advise watching other web crawler vids too that helps. do stick to this course, but if you can’t understand a topic from bucky – learn the same topic from others then come back here you’re more likely to understand. that works for me

ToberiusCricket says:

Could you crawl a web page and add all results to a database within one single Python script?

syds graphics says:

by far, and ive searched the ocean bottoms and moutain tops. did i say by far ? ummmm, just wanna say thanks, actually i wanna say a bit more. this 23-27yr. old age bracket has literally taught me how to build a website by: (Tyler) using wordpress step by step, and im learning more in three days with you than….ever. finally Chance tha Rapper initally got my attention by giving all of his music away free. WHAT ? Look at you guys , not freakin asking for money, just sayin “check this out, let me show you”…….Im payin it forward Bucky, …I print T-shirts, like the best on tha planet….so my plan is finish me website, but with this integration of python functions ..huh? i would love to speak with you some day .. I would love to share something with you as an appreciation….

Nicolas dos santos says:

Working code using a craigslist page

import requests
from bs4 import BeautifulSoup
import operator

def start(url):

# Create an empty list
word_list = []
# Connect to url and retrieve info as plain text
source_code = requests.get(url).text
# Create a soup object
soup = BeautifulSoup(source_code)
# goes through all the text and takes only titles a = links

for post_text in soup.findAll(‘a’, {‘class’: ‘result-title hdrlnk’}):
# removes Html and store only the text at variable content
content = post_text.string
# set all words to lower case, separates them at white space and store them at words
words = content.lower().split()
# with words separated we can add them to the previously empty word_list

for each_word in words:
word_list.append(each_word)

# calls method that cleans up symbol from word_list
clean_up_list(word_list)

# Method removes symbols from words
def clean_up_list(word_list):

# creates an ampty list where the clean words will be stored
clean_word_list = []

# Loops through the list and remove symbols
for word in word_list:
symbols = r”!@#$%^&*()_+{}[]`~:;’,.<>|/”

# Replace all symbols with “” –> Nothing
for i in range(0, len(symbols)):
word = word.replace(symbols[i], “”)

# if when symbols are removed the word lenght is smaller than 0, dont include on list
if len(word) > 0:
print(word)

# add words to clean word list
clean_word_list.append(word)

# calls methods that creates a dictionary with the clean words and count the frequency with which they show up
create_dictionary(clean_word_list)

# Creates method which creates a dictionary with the clean words and count the frequency with which they show up
def create_dictionary(clean_word_list):

# creates an empty dictionary
word_count = {}

# If the word is already in dictionary add 1 to its counter
# Else set the counter to 1
for word in clean_word_list:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
# Go through dictionary and sort it by its keys
for key, value in sorted(word_count.items(), key=operator.itemgetter(1)):
print(key, value)

start(‘https://philadelphia.craigslist.org/search/sys’)

Syed Anas says:

Thank Me later :)
import requests
from bs4 import BeautifulSoup
def web_crawler(max_pages):

page=1
while page <= max_pages: url=’https://sfbay.craigslist.org/search/sby/cto?s=’ + str(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text,”html.parser”) for link in soup.findAll(‘a’,{‘class’:’hdrlnk’}): href = “https://sfbay.craigslist.org” + link.get(‘href’) title =link.string single_item_data(href) #print(href) #print(title) page+=1 def single_item_data(item_url): source_code = requests.get(item_url) plain_text = source_code.text soup = BeautifulSoup(plain_text,”html.parser”) for item_name in soup.findAll(‘span’,{‘id’:’titletextonly’}): print(item_name.string) web_crawler(4)

Sameer Ahmar says:

Why can’t I open your forum?

Write a comment

*

Human Verification: In order to verify that you are a human and not a spam bot, please enter the answer into the following box below based on the instructions contained in the graphic.


Do you like our videos?
Do you want to see more like that?

Please click below to support us on Facebook!

Send this to a friend

▷ Other ReviewsVehicles▷ Show Cars▷ Motorbikes▷ Scooters▷ Bicycles▷ Rims & Tires▷ Luxury BoatsFashion▷ Sunglasses▷ Luxury Watches▷ Luxury Purses▷ Jeans Wear▷ High Heels▷ Kinis Swimwear▷ Perfumes▷ Jewellery▷ Cosmetics▷ Shaving Helpers▷ Fashion HatsFooding▷ Chef Club▷ Fooding Helpers▷ Coktails & LiquorsSports▷ Sport Shoes▷ Fitness & Detox▷ Golf Gear▷ Racquets▷ Hiking & Trek Gear▷ Diving Equipment▷ Ski Gear▷ Snowboards▷ Surf Boards▷ Rollers & SkatesEntertainment▷ DIY Guides▷ Zik Instruments▷ Published Books▷ Music Albums▷ Cine Movies▷ Trading Helpers▷ Make Money▷ Fishing Equipment▷ Paintball Supplies▷ Trading Card Games▷ Telescopes▷ Knives▷ VapesHigh Tech▷ Flat Screens▷ Tech Devices▷ Camera Lenses▷ Audio HiFi▷ Printers▷ USB Devices▷ PC Hardware▷ Network Gear▷ Cloud Servers▷ Software Helpers▷ Programmer Helpers▷ Mobile Apps▷ Hearing AidsHome▷ Home Furniture▷ Home Appliances▷ Tools Workshop▷ Beddings▷ Floor Layings▷ Barbecues▷ Aquarium Gear▷ Safe Boxes▷ Office Supplies▷ Security Locks▷ Cleaning ProductsKids▷ Baby Strollers▷ Child Car Seats▷ Remote ControlledTravel▷ Luggages & Bags▷ Airlines Seats▷ Hotel Rooms▷ Fun Trips▷ Cruise Ships▷ Mexico Tours