Search
 
SCRIPT & CODE EXAMPLE
 

PYTHON

Site Download Python3

#!/usr/bin/env python


import urllib.request as urllib2
from bs4 import *
from urllib.parse  import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print( "Could not open %s" % page)
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )
Comment

PREVIOUS NEXT
Code Example
Python :: # str and int mixup in python: 
Python :: while loop choosing numbers 
Python :: #Function in python without input method with multiple results: 
Python :: Create a matrix from a range of numbers (using arange) 
Python :: python code to encrypt and decrypt a stringn with password 
Python :: how to select specific column with Dimensionality Reduction pyspark 
Python :: expand array to a certain size python 
Python :: is 2 an even number 
Python :: Cannot seem to use import time and import datetime in same script in Python 
Python :: concatenating ols model results 
Python :: "%(class)s" in django 
Python :: ipython run script with command line arguments 
Python :: matplotlib share colorbar 
Python :: IndexError: child index out of range in parsing xml for object detection 
Python :: const in python 3 
Python :: django column to have duplicate of other 
Python :: flask get summernote text 
Python :: python tkinter window size 
Python :: selecting letters in a row 
Python :: how to convert string labels to numpy array 
Python :: matplotlib convert color string to int 
Python :: python consecutive numbers difference between 
Python :: dd-mm-yy to yyyy-mm-dd in python 
Python :: map dataframe parallel 
Python :: JET token authentication in Django UTC 
Python :: set_flip_h( false ) 
Python :: filtros en python (no contiene) 
Python :: convert string to double 2 decimal places python 
Python :: finda argument index 
Python :: pygame is not defined 
ADD CONTENT
Topic
Content
Source link
Name
2+7 =