Skip to main content

Week 7: Regular Expressions

·748 words·4 mins·

Regular Expressions
#

Know as a regex is just a pattern that we can use to match strings. It is a powerful tool that allows us to search for strings that match a certain pattern.

Validation with Regular Expressions
#

email = input("Whats your email? ").strip()

username, domain = email.split("@")

if username and domain.ends with(".edu"):
    print("Valid")
else:
    print("Invalid")

re Library
#

This library let us define a pattern that we can use to match strings.

Documentation

  • re.search(pattern, string, flags=0)
import re

email = input("Whats your email? ").strip()

if re.search("@", email):
    print("Valid")
else:
    print("Invalid")

Regular Expressions Patterns
#

Symbols
#

SymbolDescription
.Matches any character except a newline.
^Matches the start of the string.
$Matches the end of the string.
*Matches 0 or more repetitions of the preceding pattern.
+Matches 1 or more repetitions of the preceding pattern.
?Matches 0 or 1 repetition of the preceding pattern.
{n}Matches exactly n repetitions of the preceding pattern.
{n,}Matches n or more repetitions of the preceding pattern.
{n,m}Matches between n and m repetitions of the preceding pattern.
[]Matches any one of the characters inside the brackets.
[^]Matches any character not inside the brackets.
``
()Groups patterns and captures the matched text.
(?:...)Groups patterns without capturing the matched text.
\Escapes special characters or denotes a special sequence.
\dMatches any digit; equivalent to [0-9].
\DMatches any non-digit; equivalent to [^0-9].
\wMatches any word character (alphanumeric plus underscore); equivalent to [a-zA-Z0-9_].
\WMatches any non-word character; equivalent to [^a-zA-Z0-9_].
\sMatches any whitespace character (spaces, tabs, newlines).
\SMatches any non-whitespace character.
\bMatches a word boundary.
\BMatches a non-word boundary.

Code using Regular Expression Patterns
#

import re

email = input("Whats your email? ").strip()

if re.search(r".+@.+\.edu", email):
    print("Valid")
else:
    print("Invalid")

Matching Start and End
#

  • ^ Matches the start of the string.
  • $ Matches the end of the string.
import re

email = input("Whats your email? ").strip()

if re.search(r"^.+@.+\.edu$", email):
    print("Valid")
else:
    print("Invalid")

Sets of Characters
#

  • [] Matches any one of the characters inside the brackets.
  • [^] Matches any character not inside the brackets.
1. [^@]+
#
import re

email = input("Whats your email? ").strip()

if re.search(r"^[^@]+@[^@]+\.edu$", email):
    print("Valid")
else:
    print("Invalid")
2. [a-zA-Z0-9_]+
#
import re

email = input("Whats your email? ").strip()

if re.search(r"^[a-zA-Z0-9_]+@[a-zA-Z0-9_]+\.edu$", email):
    print("Valid")
else:
    print("Invalid")

Character Classes
#

  • \d Matches any digit; equivalent to [0-9].
  • \D Matches any non-digit; equivalent to [^0-9].
  • /s Matches any whitespace character (spaces, tabs, newlines).
  • \S Matches any non-whitespace character.
  • \w Matches any word character (alphanumeric plus underscore); equivalent to [a-zA-Z0-9_].
  • \W Matches any non-word character; equivalent to [^a-zA-Z0-9_].
import re

email = input("Whats your email? ").strip()

if re.search(r"^\w+@\w+\.edu$", email):
    print("Valid")
else:
    print("Invalid")

Flags
#

  • re.IGNORECASE Makes the pattern case-insensitive.
  • re.MULTILINE Makes the pattern match the start and end of each line.
  • re.DOTALL Makes the . character match any character, including newlines.
import re

email = input("Whats your email? ").strip()

if re.search(r"^\w+@\w+\.edu$", email, re.IGNORECASE):
    print("Valid")
else:
    print("Invalid")

Groups
#

  • A|B Matches either the pattern before or the pattern after the pipe.
  • () Groups patterns and captures the matched text.
  • (?:...) Groups patterns without capturing the matched text.
import re

email = input("Whats your email? ").strip()

if re.search(r"^(\w|\.)+@(\w+\.)?\w+\.edu$", email, re.IGNORECASE):
    print("Valid")
else:
    print("Invalid")

Email Address Validation
#

This is the regular expression pattern that the browsers uses to validate email addresses.

^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

match, fullmatch
#

  • re.match(pattern, string, flags=0) Matches the pattern at the start of the string.
  • re.fullmatch(pattern, string, flags=0) Matches the pattern against the whole string.

Capturing Groups
#

  • When we use () we are creating a capturing group that allows us to extract the matched text.
  • := is the walrus operator that allows us to assign a value to a variable and use it in the same line.
import re

name = input("Whats your name? ").strip()
if matches := re.search(r"^(.+), *(.+)$", name):
    name = matches.group(2) + " " + matches.group(1)

print(f"Hello, {name}")

Extracting from Strings
#

1. re.sub
#
  • .re.sub(pattern, repl, string, count=0, flags=0) Replaces the matched text with the replacement text.
import re

url = input("URL: ").strip()

username = re.sub(r"^(https?://)?(www\.)?twitter\.com/", "", url)
print(f"Username: {username}")
2. re.search
#
  • re.search(pattern, string, flags=0) Searches for the pattern in the string.
import re

url = input("URL: ").strip()

if matches := re.search(r"^https?://?(?:www\.)?twitter\.com/([a-z0-9_]+)", url, re.IGNORECASE)
    print(f"Username: {matches.group(1)}")

Conclusion
#

There are other other functions

  • re.split(pattern, string, maxsplit=0, flags=0) Splits the string at the matches of the pattern.
  • re.findall(pattern, string, flags=0) Finds all the matches of the pattern in the string.
Gael Mora
Author
Gael Mora
IT Security student, Python and Go developer. Specialized in Linux systems administration and automation. Passionate about cloud and network infrastructure, software development and open source technologies.