python crawler common library BeautifulSoup detailed
This is the 16th original article on learning python on a daily basis
After the previous few articles, I guess you already know how to crawl quite a few small to medium sized websites. But some say that the preceding canon is hard ugh, can't learn it well. Regularity is indeed difficult, as someone has said: if a problem is solved with regularity, then it becomes two problems. So it's normal to say that you can't learn, and fear not, there is another powerful library we can use to parse html in addition to the regular. So, today's topic is to learn about this powerful library - theBeautifulSoup But the orthodoxy still needs a lot of practice.
Since it's a third-party library we need to download it by hitting the following code at the command line
pip install beautifulsoup4
Installing third-party parsing libraries
pip install lxml
pip install html5lib
If you don't know what the use is, look down
1.Introduction to related parsing libraries
The official recommended parsing library here is lxml, because it is efficient. The following are all parsed using the lxml parsing library.
2. Detailed syntax introduction
This article is for parsing the home page of Douban Books https://book.douban.com/
1) Create bs objects
from bs4 import BeautifulSoup import requests response = requests.get('https://book.douban.com/').text # print(response) # Create bs objects soup = BeautifulSoup(response, 'lxml') # Used tolxml parsing library
2) Get relevant tags
Tags.
<a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>
uppera even A label name, easiest of even<a></a> thus, It can be simply understood as<> interior of First word even tag name
# Get the label print(soup.li) # This just gets the first li tag # Results <li class=""> <a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a> </li>
3) Get the name and content of the tag
Name and content of label.
<a > Douban, PRC social networking website</a>
As mentioned above, a is the name of the label, and what is interspersed between the two labels is what we say, as above, Douban, PRC social networking website is the content of that tag
# Get the label name print(soup.li.name) # Get the label elements print(soup.li.string) # This can only be fetched correctly if the tag has no children, otherwise it returns None # Results li None
Since this li tag There's a sub-tag inside, so it's text content is None
Here's one to get its text content
# Get the label inside the label print(soup.li.a) print(soup.li.a.string) # This tag has no sub-tags so it can get the content # Results <a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a> Douban, PRC social networking website
4) Get the tag attributes, there are two ways
Tagged attributes.
<a href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>
It can be simply understood as the property even Next to the label name and before the<> Inside the symbol of, in addition be There is an equal sign to carry out the embodiment of。 So the abovehref is the name of the tag attribute that To the right of the equal sign of even properties of happen to , the value above is a URL
# Get the label properties print(soup.li.a['href']) # first type print(soup.li.a.attrs['href']) # second type # Results https://www.douban.com https://www.douban.com
5) Get the sub-tags inside the tag
egg Tags.
<li><a> Douban, PRC social networking website</a></li>
For example, we are now acquiring the li tag So... a tag even li tag of sublabel
# Get the label inside the label print(soup.li.a) # Results <a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a>
6) Get all child nodes
Child nodes: this is similar to child tags, except that here you are getting all the child tags under a tag, the above is just getting the child tags closest to that tag
# Get child nodes print(soup.div.contents) # Return a list The first method for n, tag in enumerate(soup.div.contents): print(n, tag) # Results [' ', <div class="bd"> <div class="top-nav-info"> <a class="nav-login" href="https://www.douban.com/accounts/login?source=book" rel="nofollow"> log in</a> ... 0 1 <div class="bd"> <div class="top-nav-info"> ...
This is to get all the child nodes under div,.content even Get child nodes of properties
7) The second method gets all the child nodes
The second method Print (soup.div.children) returns an iterator for n, tag in enumerate(soup.div.children): print(n, tag)
This one is with.children Get all child nodes, this method returns an iterator
8) Get the descendant nodes of the label, that is, all descendants
Descendant nodes.
<ul> <li> <a> Douban, PRC social networking website</a> </li> </ul>
You know from above, li tag be ul tag of sublabel, a tag be li tag of sublabel If at this point we are getting ul tag So... Both the li tag and the a tag are descendant nodes of the ul tag
# Get the label of child node Print (soup.div.backs) returns an iterator for n, tag in enumerate(soup.div.descendants): print(n, tag) # Results ... <generator object descendants at 0x00000212C1A1E308> 0 1 <div class="bd"> <div class="top-nav-info"> <a class="nav-login" href="https://www.douban.com/accounts/login?source=book" rel="nofollow"> log in</a> ...
Here it is used .descendants property that gets the child nodes of the div tag and returns an iterator as a result
9) Get the parent node and all ancestor nodes
Since there are child and descendant nodes, and in turn parent and ancestor nodes, it's all pretty easy to understand
# Get the parent node print(soup.li.parent) # Return the entire parent node # Get ancestor nodes print(soup.li.paints) # Returns a generator for n, tag in enumerate(soup.li.parents): print(n, tag)
.parent attribute is to get the parent node, which returns the entire parent node containing that child node..parents even Get all of Ancestral Nodes, return to of be A generator
10) Get sibling nodes
Sibling nodes.
<ul> <li> <a> Douban, PRC social networking website1</a> </li> <li> <a> Douban, PRC social networking website2</a> </li> <li> <a> Douban, PRC social networking website3</a> </li> </ul>
for example upperhtml code, interior of li tag (not) at all be ul tag of subnode, but (not) li tag (not) at all be in the same class of, So the above li tag (not) at all be respective of brother。 these even Sibling Nodes。
# Get sibling nodes print(soup.li.next_siblings) # Get all sibling nodes of this tag, excluding itself Returns a generator for x in soup.li.next_siblings: print(x) # Results <generator object next_siblings at 0x000002A04501F308> <li class="on"> <a data-moreurl-dict='{"from":"top-nav-click-book","uid":"0"}' href="https://book.douban.com"> read a book</a> </li> ...
The .next_siblings attribute is to get all the sibling nodes of that tag that follow him, excluding himself. Also the return result is an iterator
Similarly, since there is a way to get all of his next sibling tags, there is also a way to get all of his previous sibling tags
soup.li.previous_siblings
If it's just a matter of getting one, you can choose to put the above attribute after the Remove the s letter That is, as follows
soup.li.previous_sibling # Get the previous sibling node soup.li.next_sibling # Get the next sibling node
3.A more advanced use of the bs library
In the front we can get the label of name、 properties、 Content and all of Grandchildren Tags。 however be When we need to get any of the specified properties tagged with also be It's a little difficult. of So..., At this point there is the following method:
soup.find_all( name , attrs , recursive , text , **kwargs )
1) First use the name parameter to search
# Use the name parameter first print(soup.find_all('li')) # Return a list, all of li tag name # Results [<li class=""> <a data-moreurl-dict='{"from":"top-nav-click-main","uid":"0"}' href="https://www.douban.com" target="_blank"> Douban, PRC social networking website</a> </li>, <li class="on"> ...
Here we get all tags with the nameli tagged with
2) Use name and attrs parameters
# Use name and attrs parameters print(soup.find_all('div', {'class': 'more-meta'})) # This one screens the last one, Property parameter fill of be A dictionary type of # Results [<div class="more-meta"> <h4 class="title"> stab </h4> ...
Here a search was conducted with the propertyclass='more-meta' ofdiv label
3) Search by keyword parameters
# The same can be done for related properties print(soup.find_all(class_='more-meta')) # Using keyword parameters, owing toclass bepython Keywords. So... Keyword parameters need to be distinguished by adding an extra underscore # Results Same result as above ...
Note here that we look for theclass attributes aremore-meta tagged with , which uses the keyword argument, but python has the class keyword in it, so in order not to make syntax errors, you need to add the class with an underscore
Other parameters of the will not be introduced, you can go to the official website to see for yourself
4) find() method
This method is similar to thefind_all() Same method., Only this method only be Just look up a tag, the latter be Find all eligible tagged with。
There are many similar methods, the usage is similar, so I won't demonstrate one by one, you can check the official website if you need
5) select() method
This method uses a css selector to filter the tags.
css selector: is to select a tag based on its name, id and class attributes.
If you're not too good at it yet, you can just hit f12 on your browser to see
The position at the point of the arrow is the expression of the selector
The code is as follows
# You can also filter elements with the tag selector, which returns a list print(soup.select('ul li div')) # this one be Filter by tag name print(soup.select('.info .title')) # this one be depending onclass to do the screening print(soup.select('#footer #icp')) # this one be depending onid to do the screening # The above can be mixed and matched print(soup.select('ul li .cover a img'))
Here the get property and text content
# Get properties for attr in soup.select('ul li .cover a img'): # print(attr.attrs['alt']) # You can do the same print(attr['alt']) # Get the label of elements for tag in soup.select('li'): print(tag.get_text()) # It can contain sub-tags, which will be output along with the contents of the sub-tags
The .get_tex() method is a bit different from the previous .string property haha, here he will get all the text content of the tag, with or without sub-tags
Write at the end
All of the above is a bit of personal notes made during the course of the study. It's still a bit lacking, so feel free to point out any mistakes to the big boys if there are any haha. If you want to see more related usage you can check the official documentation: http://beautifulsoup.readthedocs.io/zh_CN/latest/
Study reference: https://edu.hellobi.com/course/157
If this article was useful to you, how about a like and a retweet?
Also, I wish you all a happy April Fool's Day today
MORE