cool hit counter python crawler knowledge point 4 - common parsing library re regular_Intefrankly

python crawler knowledge point 4 - common parsing library re regular


There are tangerines in the south of the river, and they are still green through the winter.

Will it be warmed by the ground? Since there are years of cold hearts.

I could recommend the guest of honor, but the obstacles are too deep.

Fate is only what it meets, and the cycle is not to be found.

In vain is the tree peach and plum, but is there no shade in this wood?

Today the Inspector will discuss a common parsing library, the re regular library, and the Inspector will look at several aspects of what a regular is, regular expression syntax, and regular parsing functions.

Part 1 - What is re?

re is short for regular expression (regular expression), also known as regex/regexp, which is short for regular expression. re is a concept in computer science that is a logical formula for manipulating strings/text, i.e., using a predefined set of specific characters and specific combinations of those characters to form a "regular string" through which the filtering logic of the string/text is expressed to get the data result we want.

re is a generic rule expression, and for ease of use, python has a built-in re module that we can call in python via the import re syntax.

Part 2 - Regular expression syntax

The Inspector divides regular expression syntax into two main parts: character expression syntax, and number expression syntax, where character expression syntax is used to declare specific characters or symbols, and number expression syntax is used to declare the number of characters or symbols that match.

2.1 Character expression syntax

2.2 Quantitative expression grammar

2.3 Other expressions

Part 3 - Regular Parsing Functions

3.1 re.match()

re.match() is the match function for the regular, re.match(pattern, string, flags=0), where pattern is the matching rule, string is the original text, and flags is an optional parameter for the matching pattern (e.g. re. S mode gives . (dotted) line break rules, i.e., any character including a line break).

re.match() can only match from the first letter of the string, not from the middle, and it can only return the first match of the matching rule, which has some limitations.

importre

contents ='hello, my phone number is 123456789.'

# re.match matches numbers

# Wrong expression - no match from the first letter of the string

result = re.match('(d+)',contents)

print(result.group(1))

#The right way to express it

result = re.match('hello.*?(d+)',contents)

print(result.group(1))

3.2 re.search()

re.search() is a regular matching function, re.search(pattern, string, flags=0), where pattern is the matching rule, string is the original text, and flags is an optional parameter for the matching pattern (e.g. re. S mode gives . (dotted) line break rules, i.e., any character including a line break).

re.search() makes up for the fact that the match method can only match from the first letter. The search method can query for a match from anywhere in the text, but again, it can only return the first match of the match rule.

# re.search()

# re.search() doesn't have to match from the first letter

result = re.search('(d+)',contents)

print(result.group(1))

Note: re.match(), re.search() can only return the first match, can not match all eligible data, so it has some limitations, if you want to achieve multiple matches, you need to use findall method at this time.

3.3 re.findall()

re.findall() is the regular matching function, re.findall(pattern, string, flags=0), which is used to return all results that match the match, where pattern is the matching rule, string is the original text, and flags is the matching pattern (e.g. re. S mode gives . (dotted) line break rules, i.e., any character including a line break).

The re.findall method compensates for the match/search method by returning all the results of the matching rule.

# re.findall()

contents ='hello 001, I am 002'

result = re.findall('(d+)',contents)

print(result)

# Return the desired result as a list['001', '002']

3.4 re.sub()

re.sub is the regular replacement method, re.sub(pattern, repl, string), where pattern is the lookup rule for the character being replaced, repl is the replacement character, and string is the original text.

# re.sub()

contents ='hello 001, I am Data Pawn'

result = re.sub('(d+)',' Inspector Wong Elephant.',contents)

print(result)

# The result returned is'hello Inspector Wong Elephant., I am Data Pawn'

3.5 re.compile()

re.compile is the compile method for the regular, re.compile(pattern, flags=0), compiles the match rule into a match rule object, which is easy to call multiple times at any time, where pattern is the match rule and flags is the match pattern, as optional parameters.

# re.compile()

contents ='hello 001, I am 002'

patterns = re.compile('d+')

result = re.findall(patterns,contents)

print(result)

# The result returned is also in the form of a list of['001', '002']

3.6 Matching rules-re. S

# Matching rules

# Wrong expression - not using re. S matching mode

contents ='''hello my phone number is 1234

5678, please call me'''

result = re.search('is (.*?),',contents)

print(result.group(1))

# An exception occurred and the correct result could not be returned

# The right way to express it

result = re.search('is (.*?),',contents,re.S)

print(result.group(1))

# Return correct match result'1234 5678'

Part 4 - Small Cases

# Small Cases - Cat's Eye Top Movies

importrequests

importre

url ='http://maoyan.com/board'

headers = {'User-Agent':'Mozilla/5.0 (Windows '+

'NT 6.3; Win64; x64) AppleWebKit/'+

'537.36 (KHTML, like Gecko) Chrome/'+

'66.0.3359.181 Safari/537.36'}

# Define web requests, get web content

defget_html(ulr):

response = requests.get(url,headers=headers)

response.encoding ='utf-8'

html = response.text

returnhtml

# Define the parsing process to get the desired data

defparse_html(content):

patterns = re.compile('class="board-index.*?>(.*?)'+

'.*?

.*?data-val.*?>(.*?)'+

'.*?

(.*?)

.*?

'"releasetime">(.*?)

.*?class="integer">'+

'(.*?).*?Fraction">(.*?)',re.S)

result = re.findall(patterns,content)

foriteminresult:

yield{

'rank': item[],

'name': item[1],

'actor': item[2].strip().split(' lead actor:')[1],

'releasetime': item[3].strip().split(':')[1],

'score': item[4] + item[5]

}

if__name__ =='__main__':

html = get_html(url)

result = parse_html(html)

foriteminresult:

print(item)

Running the above program yields the following results for the crawled data.

These are the common parsing librariesre The use of the, The detective hopes to help you.! If you have any questions, Feel free to contact the detective.( Communicating with the Inspector--> Learning exchange can be added to the detective WeChat), Or leave an interactive message!

I'm looking forward to meeting you in a sea of people, in the most beautiful moments for you and me.


Recommended>>
1、Springaops proxy mechanism
2、NVIDIA Big Data shines again in first BDA showcase
3、Stanford and Harvard papers My country is not so great in many ways Chinas top 10 hightech fields lag behind the US Only this field Chinese companies are proud of the world
4、Why is there more than one space after the conversion of ABAP integer 1 to string
5、Decorator pattern and the io class architecture

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号