python crawler common library BeautifulSoup detailed


This is the 16th original article on learning python on a daily basis

After the previous few articles, I guess you already know how to crawl quite a few small to medium sized websites. But some say that the preceding canon is hard ugh, can't learn it well. Regularity is indeed difficult, as someone has said: if a problem is solved with regularity, then it becomes two problems. So it's normal to say that you can't learn, and fear not, there is another powerful library we can use to parse html in addition to the regular. So, today's topic is to learn about this powerful library - theBeautifulSoup But the orthodoxy still needs a lot of practice.

Since it's a third-party library we need to download it by hitting the following code at the command line

pip install beautifulsoup4

Installing third-party parsing libraries

pip install lxml
pip install html5lib

If you don't know what the use is, look down

1.Introduction to related parsing libraries

The official recommended parsing library here is lxml, because it is efficient. The following are all parsed using the lxml parsing library.

2. Detailed syntax introduction

This article is for parsing the home page of Douban Books https://book.douban.com/

1) Create bs objects

from bs4 import BeautifulSoup
import requests
response = requests.get('https://book.douban.com/').text
# print(response)
 # Create bs objects
soup = BeautifulSoup(response, 'lxml')  # Used tolxml parsing library

2) Get relevant tags

Tags.

<a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>

uppera even A label name, easiest of even<a></a> thus, It can be simply understood as<> interior of First word even tag name

 # Get the label
 print(soup.li) # This just gets the first li tag
 # Results
<li class="">
<a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>
</li>

3) Get the name and content of the tag

Name and content of label.

<a > Douban, PRC social networking website</a>

As mentioned above, a is the name of the label, and what is interspersed between the two labels is what we say, as above, Douban, PRC social networking website is the content of that tag

 # Get the label name
print(soup.li.name)
 # Get the label elements
 print(soup.li.string) # This can only be fetched correctly if the tag has no children, otherwise it returns None
 # Results
li
None

Since this li tag There's a sub-tag inside, so it's text content is None

Here's one to get its text content

 # Get the label inside the label
print(soup.li.a)
 print(soup.li.a.string) # This tag has no sub-tags so it can get the content
 # Results
<a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>
 Douban, PRC social networking website

4) Get the tag attributes, there are two ways

Tagged attributes.

<a href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>

It can be simply understood as the property even Next to the label name and before the<> Inside the symbol of, in addition be There is an equal sign to carry out the embodiment of。 So the abovehref is the name of the tag attribute that To the right of the equal sign of even properties of happen to , the value above is a URL

 # Get the label properties
print(soup.li.a['href'])  #  first type
print(soup.li.a.attrs['href'])  #  second type
 # Results
https://www.douban.com
https://www.douban.com

5) Get the sub-tags inside the tag

egg Tags.

<li><a> Douban, PRC social networking website</a></li>

For example, we are now acquiring the li tag So... a tag even li tag of sublabel

 # Get the label inside the label
print(soup.li.a)
 # Results
<a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>

6) Get all child nodes

Child nodes: this is similar to child tags, except that here you are getting all the child tags under a tag, the above is just getting the child tags closest to that tag

 # Get child nodes
 print(soup.div.contents) # Return a list The first method
for n, tag in enumerate(soup.div.contents):
    print(n, tag)
 # Results
['
', <div class="bd">
<div class="top-nav-info">
<a class="nav-login" href="https://www.douban.com/accounts/login?source=book" rel="nofollow"> log in</a>
...
0 

1 <div class="bd">
<div class="top-nav-info">
...

This is to get all the child nodes under div,.content even Get child nodes of properties

7) The second method gets all the child nodes

 The second method
 Print (soup.div.children) returns an iterator
for n, tag in enumerate(soup.div.children):
    print(n, tag)

This one is with.children Get all child nodes, this method returns an iterator

8) Get the descendant nodes of the label, that is, all descendants

Descendant nodes.

<ul>
<li>
<a> Douban, PRC social networking website</a>
</li>
</ul>

You know from above, li tag be ul tag of sublabel a tag be li tag of sublabel If at this point we are getting ul tag So... Both the li tag and the a tag are descendant nodes of the ul tag

 # Get the label of child node
 Print (soup.div.backs) returns an iterator
for n, tag in enumerate(soup.div.descendants):
    print(n, tag)
 # Results
...
<generator object descendants at 0x00000212C1A1E308>
0 

1 <div class="bd">
<div class="top-nav-info">
<a class="nav-login" href="https://www.douban.com/accounts/login?source=book" rel="nofollow"> log in</a>
...

Here it is used .descendants property that gets the child nodes of the div tag and returns an iterator as a result

9) Get the parent node and all ancestor nodes

Since there are child and descendant nodes, and in turn parent and ancestor nodes, it's all pretty easy to understand

 # Get the parent node
 print(soup.li.parent) # Return the entire parent node
 # Get ancestor nodes
 print(soup.li.paints) # Returns a generator
for n, tag in enumerate(soup.li.parents):
    print(n, tag)

.parent attribute is to get the parent node, which returns the entire parent node containing that child node..parents even Get all of Ancestral Nodes, return to of be A generator

10) Get sibling nodes

Sibling nodes.

<ul>
<li>
<a> Douban, PRC social networking website1</a>
</li>
<li>
<a> Douban, PRC social networking website2</a>
</li>
<li>
<a> Douban, PRC social networking website3</a>
</li>
</ul>

for example upperhtml code, interior of li tag (not) at all be ul tag of subnode, but (not) li tag (not) at all be in the same class of, So the above li tag (not) at all be respective of brother。 these even Sibling Nodes。

 # Get sibling nodes
 print(soup.li.next_siblings) # Get all sibling nodes of this tag, excluding itself Returns a generator
for x in soup.li.next_siblings:
    print(x)
 # Results
<generator object next_siblings at 0x000002A04501F308>
<li class="on">
<a data-moreurl-dict='{"from":"top-nav-click-book","uid":"0"}' href="https://book.douban.com"> read a book</a>
</li>
...

The .next_siblings attribute is to get all the sibling nodes of that tag that follow him, excluding himself. Also the return result is an iterator

Similarly, since there is a way to get all of his next sibling tags, there is also a way to get all of his previous sibling tags

soup.li.previous_siblings

If it's just a matter of getting one, you can choose to put the above attribute after the Remove the s letter That is, as follows

soup.li.previous_sibling # Get the previous sibling node
 soup.li.next_sibling # Get the next sibling node

3.A more advanced use of the bs library

In the front we can get the label of name、 properties、 Content and all of Grandchildren Tags。 however be When we need to get any of the specified properties tagged with also be It's a little difficult. of So..., At this point there is the following method:

soup.find_all( name , attrs , recursive , text , **kwargs )
  • name: Need to get tagged with first name
  • attrs: Receive a dictionary for the keys of the property, or just replace it with a keyword argument, as follows
  • recursive: Set whether to search for direct child nodes
  • text: Corresponding string content
  • limit: Set the number of searches

1) First use the name parameter to search

 # Use the name parameter first
print(soup.find_all('li'))  #  Return a list, all of li tag name
 # Results
[<li class="">
<a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>
</li>, <li class="on">
...

Here we get all tags with the nameli tagged with

2) Use name and attrs parameters

 # Use name and attrs parameters
print(soup.find_all('div', {'class': 'more-meta'}))  #  This one screens the last one, Property parameter fill of be A dictionary type of
 # Results
[<div class="more-meta">
<h4 class="title">
                   stab
                </h4>
...

Here a search was conducted with the propertyclass='more-meta' ofdiv label

3) Search by keyword parameters

 # The same can be done for related properties
print(soup.find_all(class_='more-meta'))  #  Using keyword parameters, owing toclass bepython Keywords. So... Keyword parameters need to be distinguished by adding an extra underscore
 # Results
 Same result as above
...

Note here that we look for theclass attributes aremore-meta tagged with , which uses the keyword argument, but python has the class keyword in it, so in order not to make syntax errors, you need to add the class with an underscore

Other parameters of the will not be introduced, you can go to the official website to see for yourself

4) find() method

This method is similar to thefind_all() Same method., Only this method only be Just look up a tag, the latter be Find all eligible tagged with。

There are many similar methods, the usage is similar, so I won't demonstrate one by one, you can check the official website if you need

5) select() method

This method uses a css selector to filter the tags.

css selector: is to select a tag based on its name, id and class attributes.

  • By tag name : Write the tag name directly, e.g. li a , this one even find li tag downwards of a tag
  • via the class attribute : For. symbol plus the class attribute value, such as .title .time this one even findclass have a value oftitle downwards ofclass have a value oftime tagged with
  • By id attribute : Use # plus the id attribute value to perform a lookup, such as #img #width This one is looking for tags with id value of width under id value of img
  • The three above can be mixed, as ul .title #width

If you're not too good at it yet, you can just hit f12 on your browser to see

The position at the point of the arrow is the expression of the selector

The code is as follows

 # You can also filter elements with the tag selector, which returns a list
print(soup.select('ul li div'))  #  this one be Filter by tag name
print(soup.select('.info .title'))  #  this one be depending onclass to do the screening
print(soup.select('#footer #icp'))  #  this one be depending onid to do the screening
 # The above can be mixed and matched
print(soup.select('ul li .cover a img'))

Here the get property and text content

 # Get properties
for attr in soup.select('ul li .cover a img'):
    # print(attr.attrs['alt'])
     # You can do the same
    print(attr['alt'])

 # Get the label of elements
for tag in soup.select('li'):
     print(tag.get_text()) # It can contain sub-tags, which will be output along with the contents of the sub-tags

The .get_tex() method is a bit different from the previous .string property haha, here he will get all the text content of the tag, with or without sub-tags

Write at the end

All of the above is a bit of personal notes made during the course of the study. It's still a bit lacking, so feel free to point out any mistakes to the big boys if there are any haha. If you want to see more related usage you can check the official documentation: http://beautifulsoup.readthedocs.io/zh_CN/latest/

Study reference: https://edu.hellobi.com/course/157

If this article was useful to you, how about a like and a retweet?

Also, I wish you all a happy April Fool's Day today

MORE


Recommended>>
1、Data Race Methodology Read This One
2、Lytro acquires Limitless will develop light field rendering tools for game engine applications
3、Suzhou to Zhangjiajie logistics company 13771741026 return truck return truck
4、360 shuts down Waterdrop Live technology should not become an accomplice to the tiger
5、5000 vulnerabilities in one connected car Telematics security needs urgent attention

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号