【python-正则】1.1Python正则表达式学习-1

fishni

python / Python基础

发布于：2020年4月13日

本文主要内容：正则表达式概述、re模块操作、表示字符、原始字符串、表示数量、表示边界、匹配分组、re模块的高级用法、贪婪和非贪婪、小练习

0x01 正则表达式概述

正则表达式，又称正规表示式、正规表示法、正规表达式、规则表达式、常规表示法（英语：Regular Expression，在代码中常简写为regex、 regexp或RE），是计算机科学的一个概念。正则表达式使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。在很多⽂本编辑器里，正则表达式通常被用来检索、替换那些匹配某个模式的文本。

Regular Expression的“Regular”⼀般被译为“正则”、 “正规”、 “常规”。此处的“Regular”即是“规则”、 “规律”的意思， Regular Expression即“描述某种规则的表达式”之意。

0x02 re模块操作

在Python中需要通过正则表达式对字符串进⾏匹配的时候，可以使⽤⼀个模
块，名字为re

2.1 re模块的使用过程

# 导入re模块
import re

# 使用match方法进行匹配操作
result = re.match(正则表达式，要匹配的字符串)

# 如果上⼀步匹配到数据的话， 可以使⽤group⽅法来提取数据
result.group()

re.match是进行正则匹配检查的方法，若字符串匹配正则表达式，则match方法返回匹配对象（Match Object）,否则返回None（注意不是空字符串””）
匹配对象Match Object对象具有group方法，用来返回字符串的匹配部分

2.2 re模块示例

#匹配hello开头的字符串
import re
result = re.match("hello","hello world")
# 查看Match  object对象
result

Out:<re.Match object; span=(0, 5), match='hello'>

1	result.group()

Out:'hello'

0x03 表示字符

正则表达式单字符匹配常用的模式

字符	功能
.	匹配任意1个字符（除了\n）
[]	匹配[]中列举的字符
\d	匹配数字，即0-9
\D	匹配非数字，即不是数字
\s	匹配空白，即空格、tab键
\S	匹配非空白
\w	匹配单词字符，即a-z、A-Z、0-9、_
\W	匹配非单词字符

3.1 示例1：.

import re
ret1 = re.match(".","a")

ret1.group()

Out:'a'

1 2	ret2 = re.match(".","bca") ret2.group()

Out:'b'

1 2	ret3 = re.match(".","@18") ret3.group()

Out:'@'

3.2 示例2：[]

1
2
3

# 如果hello的首字符⼩写， 那么正则表达式需要⼩写的h
ret1 = re.match("h","hello Python")
ret1.group()

Out:'h'

1
2
3

# 如果hello的首字符大写，则正则表达式则需要大写的H
ret2 = re.match("H","Hello Python")
ret2.group()

Out:'H'

1
2
3

# 大小写都可以
ret3 = re.match("[hH]","hello Python")
ret3.group()

Out:'h'

1 2	ret4 = re.match("[hH]","Hello Python") ret4.group()

Out:'H'

1
2
3

# 匹配0-9第一种写法
ret5 = re.match("[0123456789]","7Hello")
ret5.group()

Out:'7'

1
2
3

# 匹配0-9第二种写法
ret6= re.match("[0-9]","7Hello")
ret6.group()

Out:'7'

3.3 示例3：\d

1
2
3

# 普通匹配
ret1 = re.match("hello2","hello2world")
ret1.group()

Out:'hello2'

1
2
3

# 使用\d匹配
ret2 = re.match("hello\d","hello2world")
ret2.group()

Out:'hello2'

0x04 原始字符串

4. 1 各种示例

1
2
3

# 比如：想打印出\n,一个反斜杠时，当作换行符进行打印
s= "\ndd"
print(s)

打印：dd

1 2	s= "\\ndd" print(s)

打印：\ndd

mm = “c:\a\b\c”

1 2	mm= "c:\\a\\b\\c" mm

Out:'c:\\a\\b\\c'

print(mm)

打印：c:\a\b\c

1 2	# 这里匹配需四个反斜杠，方能匹配字符串中两个反斜杠 re.match("c:\\\\",mm).group()

Out:'c:\\'

1 2	ret = re.match("c:\\\\a",mm).group() print(ret)

打印：c:\a

1
2
3

# 使用r
ret =re.match(r"c:\\a",mm).group()
print(ret)

打印：c:\a

1 2	ret = re.match(r"c:\a",mm).group() print(ret)

打印：
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-45-1a051863b698> in <module>
----> 1 ret = re.match(r"c:\a",mm).group()
      2 print(ret)


AttributeError: 'NoneType' object has no attribute 'group'

4.2 示例说明

Python字符串前面加上r 表示原生字符串

与大多数编程语言相同，正则表达式里使用”\n”作为转义字符，这就可能造成反斜杠困扰。加入匹配文本中的字符串”\“，那么使用编程语言表示的正则表达式里将需要4个反斜杠“\\”：前两个和后两个分别用于在编程语言里转义反斜杠，转换成两个反斜杠后再在正则表达式里转义成一个反斜杠

Python里的原生字符串很好的解决了这个问题，不再担心漏写反斜杠，写出的表达式更直观

1
2
3

mm= "c:\\a\\b\\c"
ret = re.match(r"c:\\a",mm).group()
print(ret)

打印：c:\a

0x05 表示数量

匹配多个字符的相关格式

字符	功能
*	匹配前一个字符出现0次或者无限次，即可有可无
+	匹配前一个字符出现1次或者无限次，即至少有1次
?	匹配前一个字符出现1次或者0次, 即要么有1次，要么没有
{m}	匹配前一个字符出现m次
{m,}	匹配前一个字符至少出现m次
{m,n}	匹配前一个字符出现从m到n次

5.1 示例1：*

需求：匹配出，一个字符串第一个字母为大写字符，后面都是小写字母并且这些小写字母可有可无

1
2
3

import re
ret = re.match("[A-Z][a-z]*","Mm")
ret.group()

Out:'Mm'

1
2
3

# *前无匹配字符时是可以的
ret = re.match("[A-Z][a-z]*","M")
ret.group()

Out:'M'

1
2
3

# *前字符的匹配有连续多个
ret = re.match("[A-Z][a-z]*","Mabcdef")
ret.group()

Out:'Mabcdef'

5.2 示例2：+

需求：匹配出，变量名是否有效

1
2
3

# [a-zA-Z]+表示与[a-zA-Z]至少匹配成功一个字母，[\w]* 表示*前字符可有0个或多个
ret = re.match("[a-zA-Z]+[\w]*","name1")
ret.group()

Out:'name1'

1 2	ret = re.match("[a-zA-Z_]+[\w]*","_name") ret.group()

Out:'_name'

1 2	ret = re.match("[a-zA-Z_]+[\w]*","2_name") ret.group()

打印:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-65-572ee7f143b4> in <module>
      1 ret = re.match("[a-zA-Z_]+[\w]*","2_name")
----> 2 ret.group()


AttributeError: 'NoneType' object has no attribute 'group'

5.3 示例3：？

需求：匹配出，0-99之间的数字

1 2	ret = re.match("[1-9]?[0-9]","7") ret.group()

Out:'7'

1 2	ret = re.match("[1-9]?[0-9]","33") ret.group()

Out:'33'

1
2
3

# 结果原因：第一个数字[1-9]未匹配到，？出现0次或1次；所以后一个[0-9]匹配了0
ret =re.match("[1-9]?[0-9]","09")
ret.group()

Out:'0'

5.4 示例4：{m}

需求：匹配出，8-20位的密码，可以大小写英文字母、数字、下划线

import re

ret =re.match("[a-zA-Z0-9_]{6}","12a3g45678")
ret.group()

Out:'12a3g4'

1 2	ret = re.match("[a-zA-Z0-9_]{8,20}","1ad12fffs39d739473920_d398sd") ret.group()

Out:'1ad12fffs39d73947392'

1
2
3

# [\w]==[a-zA-Z0-9_]，当然其他也有各自的等价表示方法
ret = re.match("[\w]{8,20}","1ad12fffs39d739473920_d398sd")
ret.group()

Out:'1ad12fffs39d73947392'

简单练习，匹配出163的邮箱地址，且@符号之前有4-20位，例如hello@163.com

1
2
3

# 简单练习，匹配163邮箱地址(不完善)
ret =re.match("[\w]{4,20}@163\.com","hell0@163.com")
ret.group()

Out:'hell0@163.com'

1
2
3

# 思考如何匹配以。。。结尾
ret = re.match("[\w]{4,20}@163\.com","hello@163.comddddd")
ret.group()

Out:'hello@163.com'

0x06 表示边界

字符	功能
^	匹配字符串开头
$	匹配字符串结尾
\b	匹配一个单词的边界
\B	匹配非单词边界

6.1 示例1：$

需求：匹配163.com的邮箱地址

import re

# 正确的地址
ret = re.match("[\w]{4,20}@163\.com","xiaoWang@163.com")
ret.group()

Out:'xiaoWang@163.com'

1
2
3

# 不正确的地址
ret = re.match("[\w]{4,20}@163\.com","xiaoWang@163.comheihhh")
ret.group()

Out:'xiaowang@163.com'

1
2
3

# 通过$来确定末尾,从而过滤不正确地址
ret = re.match("[\w]{4,20}@163\.com$","xiaoWang@163.comheihhh")
ret.group()

打印：
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-93-83efc35bff7d> in <module>
      1 # 通过$来确定末尾
      2 ret = re.match("[\w]{4,20}@163\.com$","xiaoWang@163.comheihhh")
----> 3 ret.group()


AttributeError: 'NoneType' object has no attribute 'group'

6.2 示例2：\b

1	re.match(r".*\bver\b","ho ver abc").group()

Out:'ho ver'

1 2	# 当单词右边没有边界时，使用\b出错 re.match(r".*\bver\b","ho verabc").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-95-d65bbce81439> in <module>
----> 1 re.match(r".*\bver\b","ho verabc").group()


AttributeError: 'NoneType' object has no attribute 'group'

1 2	# 当指定单词左边无边界时，也出错 re.match(r".*\bver\b","hover abc").group()

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-96-53f69a15373e> in <module>
----> 1 re.match(r".*\bver\b","hover abc").group()


AttributeError: 'NoneType' object has no attribute 'group'

6.3 示例3：\B

1	re.match(r".*\Bver\B","hoverabc").group()

Out:'hover'

1 2	# 当指定词有左边界时，出错 re.match(r".*\Bver\B","ho verabc").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-98-a9d342249fa1> in <module>
      1 # 当指定词有左边界时，出错
----> 2 re.match(r".*\Bver\B","ho verabc").group()


AttributeError: 'NoneType' object has no attribute 'group'

1 2	# 当指定的词，有右边界时，\B出错 re.match(r".*\Bver\B","hover abc").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-99-aae06b891466> in <module>
----> 1 re.match(r".*\Bver\B","hover abc").group()


AttributeError: 'NoneType' object has no attribute 'group'

1 2	# 当指定的词，左右都有边界时，\B出错 re.match(r".*\Bver\B","ho ver abc").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-101-9630adaf466f> in <module>
      1 # 当指定的词，左右都有边界时，\B出错
----> 2 re.match(r".*\Bver\B","ho ver abc").group()


AttributeError: 'NoneType' object has no attribute 'group'

0x07 匹配分组

字符	功能

(ab)	将括号中字符作为一个分组
\num	引用分组num匹配到的字符串
(?P`<name>`)	分组起别名
(?P=name)	引用别名为name分组匹配到的字符串

7.1 示例1：|

需求：匹配出0-100之间的数字

1
2
3

import re
ret = re.match("[1-9]?\d","8")
ret.group()

Out:'8'

1	re.match("[1-9]?\d","78").group()

Out:'78'

1 2	# 不确定的情况下 re.match("[1-9]?\d","08").group()

Out:'0'

1 2	# 修正之后的 re.match("[1-9]?\d$","08").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-106-70c347cddb39> in <module>
      1 # 修正之后的
----> 2 re.match("[1-9]?\d$","08").group()


AttributeError: 'NoneType' object has no attribute 'group'

添加 |

可匹配0到100（包括0和100）

1 2	# 添加 \| re.match("[1-9]?\d$\|100","100").group()

Out:'100'

1	re.match("[1-9]?\d$\|100","0").group()

Out:'0'

7.2 示例2：()

需求：匹配出163、126、qq邮箱直接的数字

1	re.match("\w{4,20}@163\.com","test@163.com").group()

Out:'test@163.com'

1 2	#使用（）分组 re.match("\w{4,20}@(163\|126\|qq)\.com","test@126.com").group()

Out:'test@126.com'

1	re.match("\w{4,20}@(163\|126\|qq)\.com","test@gmail.com").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-111-d286dbf8fcf2> in <module>
----> 1 re.match("\w{4,20}@(163|126|qq)\.com","test@gmail.com").group()


AttributeError: 'NoneType' object has no attribute 'group'

练习：

1 2	ret = re.match("([^-]*)-(\d+)","010-12345678") ret.group()

Out:'010-12345678'

1	ret.groups()

Out:('010', '12345678')

1 2	# ret.group()默认ret.group(0) ret.group(1)

Out:'010'

1	ret.group(2)

Out:'12345678'

7.3 示例3：\

需求：匹配出<html>hh</html>

1 2	# 能够完成对正确的字符串的匹配 re.match("<[a-zA-Z]>\w</[a-zA-Z]*>","<html>hh</html>").group()

Out:'<html>hh</html>'

1 2	# 遇到非正常的html格式字符串，匹配出错 re.match("<[a-zA-Z]>\w</[a-zA-Z]*>", "<html>hh</htmlbalabala>").group()

Out:'<html>hh</htmlbalabala>'

思路：如果在第一对<>中是什么，按理说后面的那对<>中就应该是什么

1 2	# 通过引用分组中匹配到的数据即可,但是要注意是元字符串，即类似r""这种格式 re.match(r"<([a-zA-Z])>\w</\1>","<html>hh</html>").group()

Out:'<html>hh</html>'

1 2	# 因为2对<>中的数据不一致，所以没匹配出来 re.match(r"<([a-zA-Z])>\w</\1>","<html>hh</htmlddd>").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-126-f2f15b62f076> in <module>
      1 # 因为2对<>中的数据不一致，所以没匹配出来
----> 2 re.match(r"<([a-zA-Z]*)>\w*</\1>","<html>hh</htmlddd>").group()


AttributeError: 'NoneType' object has no attribute 'group'

7.4 \number

需求：匹配出<html><h1>www.baidu.com</h1></html>

1	re.match(r"<(\w>)<(\w)>.*</\2></\1","<html><h1>www.baidu.com</h1></html>").group()

Out:'<html><h1>www.baidu.com</h1></html>'

1	re.match(r"<(\w>)<(\w)>.*</\2></\1","<html><h1>www.baidu.com</h2></html>").group()

Out:
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-128-1cf071943a2d> in <module>
----> 1 re.match(r"<(\w*>)<(\w*)>.*</\2></\1","<html><h1>www.baidu.com</h2></html>").group()


AttributeError: 'NoneType' object has no attribute 'group'

7.5 示例5:`（?P<name>）(?P=name)`

1 2	ret = re.match(r"<(?P<name1>\w)><(?P<name2>\w)>.*</(?P=name2)></(?P=name1)>", "<html><h1>www.itcast.cn</h1></html>") ret.group()

Out:'<html><h1>www.itcast.cn</h1></html>'

注意：(?P<name>)和(?P=name)中的字母p大写

0x08 re模块的高级用法

search

需求：匹配出文章的阅读次数

1 2	ret =re.search(r"\d+","阅读次数为9999") ret.group()

Out:'9999'

findall

需求：统计出python、c、c++相应文章阅读的次数

1 2	ret = re.findall(r"\d+","python=9999,c=7890,c++=1234") ret

Out:['9999', '7890', '1234']

sub

将匹配到的数据进行替换

需求：将匹配到的阅读次数加1

方法1

1
2
3

# sub（要匹配的模式，要替换的内容，被替换的内容）
ret =re.sub(r"\d+","998","python=997")
ret

OUt:'python=998'

方法2

# sub可传递函数，将匹配到的内容再以参数形式传给定义的函数
import re
def add(temp):
    print(temp)
    strNum = temp.group()
    num =int(strNum) +1 
    return str(num)

ret =re.sub(r"\d+",add,"python=997")
ret

Out:<re.Match object; span=(7, 10), match='997'>





Out:'python=998'

练习:从下面的字符串中取出文本

<div>
<p>岗位职责： </p>
<p>完成推荐算法、 数据统计、 接⼝、 后台等服务器端相关⼯作</p>
<p><br></p>
<p>必备要求： </p>
<p>良好的⾃我驱动⼒和职业素养， ⼯作积极主动、 结果导向</p>
<p>&nbsp;<br></p>
<p>技术要求： </p>
<p>1、 ⼀年以上 Python 开发经验， 掌握⾯向对象分析和设计， 了解设计模式</p
> <
p>2、 掌握HTTP协议， 熟悉MVC、 MVVM等概念以及相关WEB开发框架</p>
<p>3、 掌握关系数据库开发设计， 掌握 SQL， 熟练使⽤ MySQL/PostgreSQL 中
的⼀种<br></p>
<p>4、 掌握NoSQL、 MQ， 熟练使⽤对应技术解决⽅案</p>

s="""<div>
<p>岗位职责：</p>
<p>完成推荐算法、 数据统计、 接口、 后台等服务器端相关工作</p>
<p><br></p>
<p>必备要求： </p>
<p>良好的自我驱动力和职业素养， 工作积极主动、 结果导向</p>
<p>&nbsp;<br></p>
<p>技术要求： </p>
<p>1、 一年以上 Python 开发经验， 掌握面向对象分析和设计， 了解设计模式</p> 
"""
ret =re.sub(r"</?\w*>","",s)
ret

Out:
'\n岗位职责：\n完成推荐算法、 数据统计、 接口、 后台等服务器端相关工作\n\n必备要求： \n良好的自我驱动力和职业素养， 工作积极主动、 结果导向\n&nbsp;\n技术要求： \n1、 一年以上 Python 开发经验， 掌握面向对象分析和设计， 了解设计模式 \n'

split

根据匹配进行切割字符串，并返回一个列表

需求：切割字符串“info:dddd 33 shandong”

1 2	ret =re.split(r":\| ","info:dddd 33 shandong") ret

Out:['info', 'dddd', '33', 'shandong']

0x09 Python贪婪和非贪婪

Python里数量词默认是贪婪的（在少数语言里也可能是默认非贪婪），总是尝试匹配尽可能多的字符；

非贪婪相反，总是尝试匹配尽可能少的字符。

在*、?、+、{m,n}后面加上？，使贪婪变成非贪婪

1
2
3

s = "this is a number 234-235-22-423"
r = re.match(".+(\d+-\d+-\d+-\d+)",s)
r.group(1)

Out:'4-235-22-423'

1
2
3

# 变为非贪婪
r = re.match(".+?(\d+-\d+-\d+-\d+)",s)
r.group(1)

Out:'234-235-22-423'

说明：
正则表达式模式中使用到通配字，那它在从左到右的顺序求值时，会尽量“抓取”满足匹配最长字符串，在我们上面的例子里面， “.+”会从字符串的起始处抓取满足模式的最长字符，其中包括我们想得到的第一个整型字段的中的大部分，“\d+”只需一位字符就可以匹配，所以它匹配了数字“4”，而“.+”则匹配了从字符串起始到这个第一位数字4之前的所有字符。
解决方式：非贪婪操作符“？”,这个操作符可以用在*、?、+、{m,n}后面，要求正则匹配的越少越好。

1	re.match(r"aa(\d+)","aa2343ddd").group(1)

Out:'2343'

1	re.match(r"aa(\d+?)","aa2343ddd").group(1)

Out:'2'

练习1

提取下面文本中图片链接

1	<link rel="apple-touch-icon-precomposed" href="https://s.mozhe.cn/static/ico/apple-touch-icon.png">

1
2
3

s= """<link rel="apple-touch-icon-precomposed" href="https://s.mozhe.cn/static/ico/apple-touch-icon.png"> <link rel="apple-touch-icon-precomposed" href="https://s.mozhe.cn/static/ico/apple-touch-icon.png">"""
ret =re.search(r"https:.+?\.png",s)
ret.group()

Out:'https://s.mozhe.cn/static/ico/apple-touch-icon.png'

练习2

1
2
3

https://www.baidu.com/s?wd=dd&rsv_spt=1
正则后变为：
https://www.baidu.com/

1 2	s= """https://www.baidu.com/s?wd=dd&rsv_spt=1""" re.sub(r"https://.+?/", "", s)

Out:'s?wd=dd&rsv_spt=1'

1 2	# 使用匿名函数lambda作为处理函数 re.sub(r"(https://.+?/).*",lambda x: x.group(1),s)

Out:'https://www.baidu.com/'

【python-正则】1.2Python正则表达式学习-2

本文主要是对Python正则相关知识，进行简单的梳理 0x01 正则表达式语法1.1 字符与字符类1.特殊字符：\.^$?+*{}[]()| 以上特殊字符要表示字面值，必须使用\进行转义 2...

【XSS（一）】1.2 XSS（存储型-反射型-DOM型）实战

本文主要内容存储型、反射型、DOM型造成XSS漏洞的简单原理、危害、攻击流程利用EasyAdmin极简版的实战测试案例几种XSS辅助测试工具认识危害 0x01 发现XSS1.1 XS...