cool hit counter Hands-on web crawling with Python_Intefrankly

Hands-on web crawling with Python


By Kerry Parker

Translated by Tian Xiaoning

Proofread by Nanya Ding

This article is about 2900 words , recommended reading 10 minutes.

This tutorial teaches you to crawl web pages for information, using the example of collecting data from the top 100 companies on Fast Track.

As a data scientist, one of the first things I do in my job is web data collection. Using code to collect data from websites was a completely foreign concept to me at the time, but it was one of the most logical and easily accessible sources of data. After a few attempts, web crawling has become second nature to me and is one of the skills I use almost daily.

In this tutorial, I will present a simple example of how to crawl a website where I will collect data from Fast Track on the top 100 companies of 2018:.

Fast Track:

Automating this process using a web crawler saves time by avoiding manual data collection and also allows all data to be placed in one structured file.

A quick example of implementing a simple web crawler in Python, you can find the full code presented in this tutorial on GitHub.

GitHub link.

https://github.com/kaparker/tutorials/blob/master/pythonscraper/websitescrapefasttrack.py

The following is a short tutorial overview of this article on web crawling using Python.

Connect to the page

Parsing html with BeautifulSoup

Loop through the soup object to find the element

Perform some simple Data cleaning

Writing data to csv

Get ready to start.

The first question to ask before you start using any Python application is. What libraries do I need?

For web crawling, there are a number of different libraries to consider, including.

Beautiful Soup

Requests

Scrapy

Selenium

In this example we use Beautiful Soup. You can install Beautiful Soup using the Python package manager pip at

pip install BeautifulSoup4

With these libraries installed, let's get started!

Check the web page

To know which elements you need to locate in your Python code, you first need to examine the page.

To collect data from Tech Track Top 100 companies, examine the page by right-clicking on the element of interest and then selecting Inspect. This will open the html code in which we can see the elements contained in each field.

Tech Track Top 100 companies link.

http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/

Right-click on the element of interest and select "Inspect" to display the html element.

Since the data is stored in a table, it can be fetched directly with just a few lines of code. If you want to practice crawling your website, this is a great example and a good place to start, but remember, it's not always that simple!

All 100 results are contained in the rows of the element, and these are visible on one page. This is not always the case, and when results span multiple pages, you may need to change the number of results displayed on the page, or traverse all pages to collect all the information.

The League Table page shows a table containing 100 results. It's easy to see a pattern in the html when examining the page. The results are contained in the rows of the table.

Duplicate lines Will keep our code minimal by using loops in Python to find data and write to files!

Notes. Another check that can be done is whether an HTTP GET request was made on the site, which may have returned the results as a structured response (e.g. in JSON or XML format). You can do this in the Network tab of the Inspection tool, usually in the XHR tab. After refreshing the page, it will display the request as it loads, and it is often easier to return output using a REST client (such as Insomnia) if the response contains a formatting structure.

After refreshing the page, the Network tab of the Page Inspection tool

Parsing web html with Beautiful Soup

Now that you've looked at the structure of the html and are familiar with what you'll be crawling, it's time to start using Python!

The first step is to import the library that will be used for the web crawler。 We have already discussed the aboveBeautifulSoup, It helps us to deal withhtml。 The next library we import isurllib, it Connect to the page。 lastly, We write the output to thecsv, So we also need to importcsv warehouse。 As an alternative, can be used herejson warehouse。

# import libraries

from bs4 import BeautifulSoup

import urllib.request

import csv

The next step is to define the URL you are crawling. As mentioned in the previous section, this page displays all results on one page, so the full url in the address bar is given here.

# specify the url

urlpage = 'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

Then we establish a connection to the web page, We can Parsing html with BeautifulSoup, Storing objects in variables'soup' in:

# query the website and return the html to the variable 'page'

page = urllib.request.urlopen(urlpage)

# parse the html using beautiful soup and store in variable 'soup'

soup = BeautifulSoup(page, 'html.parser')

We can print the soup variable at this stage, and it should return the fully parsed html of the page we requested.

print(soup)

If there is an error or the variable is empty, the request may not succeed. You can use the urllib.error module to implement error handling at this point.

Searching for html elements

Since all the results are contained in the table, we can use the find method to search the table for the soup object. We can then use the find_all method to find each row in the table.

If we print the number of lines, we should get a result of 101 with 100 lines plus the title.

# find results within table

table = soup.find('table', attrs={'class': 'tableSorter'})

results = table.find_all('tr')

print('Number of results', len(results))

Therefore, we can loop through the results to collect data.

Printing the first two lines of the soup object, we can see that the structure of each line is

Rank

Company

Location

Year end

Annual sales rise over 3 years

Latest sales £000s

Staff

Comment

1

Wonderbly

Personalised children's books

East London

Apr-17

294.27%

*25,860

80

Has sold nearly 3m customisable children’s books in 200 Countries

There are 8 columns in the table: Rank, Company, Location, Year End, Annual Sales Rise, Latest Sales, Staff and Comments, all of which are data of interest that we can save.

The structure of all rows of the page is consistent (which may not always be the case for all sites!) ). So again, we can use the find_all method to assign each column to a variable, then we can write to csv or JSON by searching for the element.

Loop through elements and save variables

In Python, it is useful to append the results to a list and then write the data to a file. We should declare the list and set up the csv headers before the loop, as follows.

# create and write headers to a list

rows = []

rows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales £000s', 'Staff', 'Comments'])

print(rows)

This will print out the first line we add to the list containing the headings.

You may notice that there are some extra fields in the table Webpage and Description that are not column names, but if you look closely at the html when we print the soup variables above, then the second row contains more than just the company name. We can use some further extractions to get this additional information.

The next step is to loop through the results, process the data and append it to rows that can be written to csv.

Find the results in the loop.

# loop over results

for result in results:

# find all columns per result

data = result.find_all('td')

# check that columns have data

if len(data) == 0:

continue

Since the first row of the table contains only the title, we can skip this result, as shown above. It also does not contain any elements, so when searching for elements, nothing is returned. We can then check if only results containing data are processed by requiring the length of the data to be non-zero.

Then we can start processing the data and saving it to the variables.

# write columns to variables

rank = data[0].getText()

company = data[1].getText()

location = data[2].getText()

yearend = data[3].getText()

salesrise = data[4].getText()

sales = data[5].getText()

staff = data[6].getText()

comments = data[7].getText()

The above just gets the text from each column and saves it to a variable. However, some of this data needs further cleaning to remove unwanted characters or to extract more information.

Data cleaning

If we print out the variable company, the text contains not only the company name, but also a description. We then print sales, which contains unwanted characters such as footnote symbols that are best deleted.

print('Company is', company)

# Company is WonderblyPersonalised children's books

print('Sales', sales)

# Sales *25,860

We hope to bringcompany Divided into company name and description, We can achieve this with a few lines of code。 Take another look.html, For this column, There is a element contains only the company name。 There is also a link in this column to another page on the site, It contains more detailed information about the company。 We'll use it later.!WonderblyPersonalised children's books

To split the company into two fields, we can use the find method to save the elements and then use strip or replace to remove the company name from the company variable so that it leaves only the description.

To remove unwanted characters from the sales, we can use the strip and replace methods again!

# extract description from the name

companyname = data[1].find('span', attrs={'class':'company-name'}).getText()

description = company.replace(companyname, '')

# remove unwanted characters

sales = sales.strip('*').strip('†').replace(',','')

The last variable we want to save is the company website. As mentioned above, the second column contains a link to another page that has an overview of each company. Each company page has its own form that contains the company website most of the time.

Check the url element on the company page

To grab the url from each table and save it as a variable, we need to use the same steps as above.

Find elements on the fast track website with company page URLs

Send a request to each company page URL

Parsing html with Beautifulsoup

Find the element of interest

Looking at some of the company pages, as shown in the screenshot above, the URL is located in the last row of the table, so we can search for elements within the last row.

It is also possible that the company website is not displayed, so we can use the try except condition in case the URL is not found.

Once we have saved all the data to variables, we can add each result to the list rows in a loop.

# write each result to rows

rows.append([rank, company, webpage, description, location, yearend, salesrise, sales, staff, comments])

print(rows)

Then you can try printing the variable outside the loop and checking it meets your expectations before writing it to a file!

Write to output file

If one wants to save this data for analysis, it can be done very simply in Python from our list.

# Create csv and write rows to output file

with open('techtrack100.csv','w', newline='') as f_output:

csv_output = csv.writer(f_output)

csv_output.writerows(rows)

When you run the Python script, an output file with 100 lines of results is generated, which you can view in more detail!

endnote

This is my first tutorial, so let me know if you have any questions or comments or if something isn't clear!

Web Development

https://towardsdatascience.com/tagged/web-development?source=post

Python

https://towardsdatascience.com/tagged/python?source=post

Web Scraping

https://towardsdatascience.com/tagged/web-scraping?source=post

Data Science

https://towardsdatascience.com/tagged/data-science?source=post

Programming

https://towardsdatascience.com/tagged/programming?source=post

Original title.

Data Science Skills: Web scraping using python

https://towardsdatascience.com/data-science-skills-web-scraping-using-python-d1a85ef607ed

Translator's Profile

Tian XiaoningHe is an expert in quality management, an internationally certified Lean Six Sigma Black Belt with 19 years of experience in the field; an expert in software engineering with CMMI ATM certificate and has led the company to pass CMMI Level 5 assessment; proficient in ISO9000 and ISO27000 systems, and has been the company's quality and information security director auditor for a long time, auditing more than 50 projects or departments each year; he holds a PMP certificate and is the company's project management internal trainer with hands-on experience in project management and system development.


Recommended>>
1、Crawlers proxies and Nginx
2、100 Little Secrets of Programmers
3、A deeper understanding of MySQL from a programmers perspective
4、How Filter works
5、Why are GPUs becoming the darling of generalpurpose computing

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号