3.9. String Methods

3.9.1. Rationale

  • str is immutable

  • str methods create a new modified str

3.9.2. Strip Whitespace

Strip is a very common method, which you should always call upon any text from user input, that is from input() function, but also from files, socket communication and from internet data transfer. You never know, if the user did not pasted text from other source, which will add whitespace at the end of at the beginning of a string.

There are three strip methods: left strip, right strip and strip from both ends. Word whitespace refers to:

  • \n - newline

  • \t - tab

  • `` `` - space

  • \v - vertical space

  • \f - form-feed

Most common is plain strip, which will remove all whitespace characters from both sides at the same time:

>>> name = '\tAngus MacGyver    \n'
>>> name.strip()
'Angus MacGyver'

Right strip:

>>> name = '\tAngus MacGyver    \n'
>>> name.rstrip()
'\tAngus MacGyver'

Left strip:

>>> name = '\tAngus MacGyver    \n'
>>> name.lstrip()
'Angus MacGyver    \n'

3.9.3. Change Case

Comparing not normalized strings will yield invalid or at least unexpected results:

>>> 'MacGyver' == 'Macgyver'
False

Normalize strings before comparing:

>>> 'MacGyver'.upper() == 'Macgyver'.upper()
True

This is necessary to perform further data analysis.

Upper:

>>> name = 'Angus MacGyver III'
>>> name.upper()
'ANGUS MACGYVER III'

Lower:

>>> name = 'Angus MacGyver III'
>>> name.lower()
'angus macgyver iii'

Title:

>>> name = 'Angus MacGyver III'
>>> name.title()
'Angus Macgyver Iii'

Capitalize:

>>> name = 'Angus MacGyver III'
>>> name.capitalize()
'Angus macgyver iii'

3.9.4. Replace

Replace substring:

>>> name = 'Angus MacGyver Iii'
>>> name.replace('Iii', 'III')
'Angus MacGyver III'

Replace is case sensitive:

>>> name = 'Angus MacGyver Iii'
>>> name.replace('iii', 'III')
'Angus MacGyver Iii'

3.9.5. Starts With

.startswith() method answers the question if string "starts with" other substring.

>>> email = 'mark.watney@nasa.gov'
>>>
>>>
>>> email.startswith('mark.watney')
True
>>>
>>> email.startswith('melissa.lewis')
False

It also works with tuple of strings to try:

>>> email = 'mark.watney@nasa.gov'
>>> vip = ('mark.watney', 'melissa.lewis')
>>>
>>> email.startswith(vip)
True

3.9.6. Ends With

>>> email = 'mark.watney@nasa.gov'
>>>
>>>
>>> email.endswith('nasa.gov')
True
>>>
>>> email.endswith('esa.int')
False
>>> email = 'mark.watney@nasa.gov'
>>> whitelist = ('nasa.gov', 'esa.int')
>>>
>>> email.endswith(whitelist)
True

3.9.7. Split by Line

>>> text = 'Hello\nPython\nWorld'
>>>
>>> text.splitlines()
['Hello', 'Python', 'World']
>>> text = """We choose to go to the Moon!
... We choose to go to the Moon in this decade and do the other things,
... not because they are easy, but because they are hard;
... because that goal will serve to organize and measure the best of our
... energies and skills, because that challenge is one that we are willing
... to accept, one we are unwilling to postpone, and one we intend to win,
... and the others, too."""
>>>
>>>
>>> text.splitlines()  
['We choose to go to the Moon!',
 'We choose to go to the Moon in this decade and do the other things,',
 'not because they are easy, but because they are hard;',
 'because that goal will serve to organize and measure the best of our',
 'energies and skills, because that challenge is one that we are willing',
 'to accept, one we are unwilling to postpone, and one we intend to win,',
 'and the others, too.']

3.9.8. Split by Character

  • No argument - any number of whitespaces

>>> text = '1,2,3,4'
>>> text.split(',')
['1', '2', '3', '4']
>>> setosa = '5.1,3.5,1.4,0.2,setosa'
>>> setosa.split(',')
['5.1', '3.5', '1.4', '0.2', 'setosa']
>>> text = 'We choose to go to the Moon'
>>> text.split(' ')
['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
>>> text = 'We choose to go to the Moon'
>>> text.split()
['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
>>> text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'
>>> text.split(' ')
['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']
>>> text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'
>>> text.split()
['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']

3.9.9. Join by Character

>>> letters = ['a', 'b', 'c']
>>> ''.join(letters)
'abc'
>>> words = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
>>> ' '.join(words)
'We choose to go to the Moon'
>>> setosa = ['5.1', '3.5', '1.4', '0.2', 'setosa']
>>> ','.join(setosa)
'5.1,3.5,1.4,0.2,setosa'
>>> crew = ['First line', 'Second line', 'Third line']
>>> '\n'.join(crew)
'First line\nSecond line\nThird line'

3.9.10. Join Numbers

Method str.join() expects, that all arguments are strings. Therefore it raises and error if sequence of numbers is passed:

>>> data = [1, 2, 3]
>>> ','.join(data)
Traceback (most recent call last):
TypeError: sequence item 0: expected str instance, int found

In order to avoid errors, you have to manually convert all the values to strings before passing them to str.join(). In the following example the generator expression syntax is used. It will apply str() to all elements in data. More information in Generator Expression:

>>> data = [1, 2, 3]
>>> ','.join(str(x) for x in data)
'1,2,3'

You can also use map() function. Map will apply str() to all elements in data. More information in Generator Mapping:

>>> data = [1, 2, 3]
>>> ','.join(map(str,data))
'1,2,3'

3.9.11. Is Whitespace

>>> text = ''
>>> text.isspace()
False
>>> text = ' '
>>> text.isspace()
True
>>> text = '\t'
>>> text.isspace()
True
>>> text = '\n'
>>> text.isspace()
True
../../_images/str-methods-iss.jpg

Figure 3.2. ISS - International Space Station. Credits: NASA/Crew of STS-132 (img: s132e012208).

3.9.12. Is Alphabet Characters

>>> text = 'hello'
>>> text.isalpha()
True
>>> text = 'hello1'
>>> text.isalpha()
False

3.9.13. Is Numeric

>>> '1'.isdecimal()
True
>>>
>>> '+1'.isdecimal()
False
>>>
>>> '-1'.isdecimal()
False
>>>
>>> '1.'.isdecimal()
False
>>>
>>> '1,'.isdecimal()
False
>>>
>>> '1.0'.isdecimal()
False
>>>
>>> '1,0'.isdecimal()
False
>>>
>>> '1_0'.isdecimal()
False
>>>
>>> '10'.isdecimal()
True
>>> '1'.isdigit()
True
>>>
>>> '+1'.isdigit()
False
>>>
>>> '-1'.isdigit()
False
>>>
>>> '1.'.isdigit()
False
>>>
>>> '1,'.isdigit()
False
>>>
>>> '1.0'.isdigit()
False
>>>
>>> '1,0'.isdigit()
False
>>>
>>> '1_0'.isdigit()
False
>>>
>>> '10'.isdigit()
True
>>> '1'.isnumeric()
True
>>>
>>> '+1'.isnumeric()
False
>>>
>>> '-1'.isnumeric()
False
>>>
>>> '1.'.isnumeric()
False
>>>
>>> '1.0'.isnumeric()
False
>>>
>>> '1,0'.isnumeric()
False
>>>
>>> '1_0'.isnumeric()
False
>>>
>>> '10'.isnumeric()
True
>>> '1'.isalnum()
True
>>>
>>> '+1'.isalnum()
False
>>>
>>> '-1'.isalnum()
False
>>>
>>> '1.'.isalnum()
False
>>>
>>> '1,'.isalnum()
False
>>>
>>> '1.0'.isalnum()
False
>>>
>>> '1,0'.isalnum()
False
>>>
>>> '1_0'.isalnum()
False
>>>
>>> '10'.isalnum()
True

3.9.14. Find Sub-String Position

Finds position of a letter in text:

>>> text = 'We choose to go to the Moon'
>>> text.find('M')
23

Will find first occurrence:

>>> text = 'We choose to go to the Moon'
>>> text.find('o')
5

Also works on substrings:

>>> text = 'We choose to go to the Moon'
>>> text.find('Moo')
23

Will yield -1 if substring is not found:

>>> text = 'We choose to go to the Moon'
>>> text.find('x')
-1

3.9.15. Count Occurrences

>>> text = 'Moon'
>>>
>>>
>>> text.count('o')
2
>>>
>>> text.count('Moo')
1
>>>
>>> text.count('x')
0

3.9.16. Remove Prefix or Suffix

Since Python 3.9: PEP 616 -- String methods to remove prefixes and suffixes

>>> filename = 'myfile.txt'
>>> filename.removeprefix('my')
'file.txt'
>>> filename = 'myfile.txt'
>>> filename.removesuffix('.txt')
'myfile'

3.9.17. Method Chaining

>>> text = 'Python'
>>>
>>> text = text.upper()
>>> text = text.replace('P', 'C')
>>> text = text.title()
>>>
>>> print(text)
Cython
>>> text = 'Python'
>>>
>>> text = text.upper().replace('P', 'C').title()
>>>
>>> print(text)
Cython
>>> text = 'Python'
>>>
>>> text.upper().replace('P', 'C').title()
'Cython'

How it works:

  1. text -> 'Python'

  2. 'Python'.upper() -> 'PYTHON'

  3. 'PYTHON'.replace('P', 'C') -> 'CYTHON'

  4. 'CYTHON'.title() -> 'Cython'

>>> text = 'Python'
>>>
>>> text = text.upper().startswith('P').replace('P', 'C')
Traceback (most recent call last):
AttributeError: 'bool' object has no attribute 'replace'

Note, that there cannot be any char, not even space after \ character:

>>> text = 'Python'
>>>
>>> text = text.upper() \
...            .replace('P', 'C') \
...            .title()
>>>
>>> print(text)
Cython
>>> text = 'Python'
>>>
>>> text = (text.upper()
...             .replace('P', 'C')
...             .title())
>>>
>>> print(text)
Cython

3.9.18. Use Case - 0x01

>>> DATA = 'ul. pANA tWARdoWSKiego 3'
>>>
>>> result = (
...     DATA
...
...     # Normalize
...     .upper()
...
...     # Remove whitespace control chars
...     .replace('\n', ' ')
...     .replace('\t', ' ')
...     .replace('\v', ' ')
...     .replace('\f', ' ')
...
...     # Remove whitespaces
...     .replace('    ', ' ')
...     .replace('   ', ' ')
...     .replace('  ', ' ')
...
...     # Remove special characters
...     .replace('$', '')
...     .replace('@', '')
...     .replace('#', '')
...     .replace('^', '')
...     .replace('&', '')
...     .replace('.', '')
...     .replace(',', '')
...     .replace('|', '')
...
...     # Remove prefixes
...     .removeprefix('ULICA')
...     .removeprefix('UL')
...     .removeprefix('OSIEDLE')
...     .removeprefix('OS')
...
...     # Substitute
...     .replace('3', 'III')
...     .replace('2', 'II')
...     .replace('1', 'I')
...
...     # Format output
...     .title()
...     .replace('Iii', 'III')
...     .replace('Ii', 'II')
...     .strip()
... )

3.9.19. Assignments

Code 3.24. Solution
"""
* Assignment: String Methods Splitlines
* Required: no
* Complexity: easy
* Lines of code: 1 lines
* Time: 3 min

English:
    1. Split `DATA` by lines
    2. Run doctests - all must succeed

Polish:
    1. Podziel `DATA` po liniach
    2. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'

    >>> assert len(result) == 3, \
    'Variable `result` length should be 3'

    >>> assert type(result) is list, \
    'Variable `result` has invalid type, should be list'

    >>> line = 'We choose to go to the Moon'
    >>> assert line in result, f'Line "{line}" is not in the result'

    >>> line = 'in this decade and do the other things.'
    >>> assert line in result, f'Line "{line}" is not in the result'

    >>> line = 'Not because they are easy, but because they are hard.'
    >>> assert line in result, f'Line "{line}" is not in the result'
"""

DATA = """We choose to go to the Moon
in this decade and do the other things.
Not because they are easy, but because they are hard."""

# list[str]: with DATA split by lines
result = ...

Code 3.25. Solution
"""
* Assignment: String Methods Join
* Required: no
* Complexity: easy
* Lines of code: 1 lines
* Time: 3 min

English:
    1. Join lines of text with newline (`\n`) character
    2. Run doctests - all must succeed

Polish:
    1. Połącz linie tekstu znakiem końca linii (`\n`)
    2. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `str.join()`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'

    >>> assert type(result) is str, \
    'Variable `result` has invalid type, should be str'

    >>> assert result.count('\\n') == 2, \
    'There should be only two newline characters in result'

    >>> line = 'We choose to go to the Moon'
    >>> assert line in result, f'Line "{line}" is not in the result'

    >>> line = 'in this decade and do the other things.'
    >>> assert line in result, f'Line "{line}" is not in the result'

    >>> line = 'Not because they are easy, but because they are hard.'
    >>> assert line in result, f'Line "{line}" is not in the result'
"""

DATA = ['We choose to go to the Moon',
        'in this decade and do the other things.',
        'Not because they are easy, but because they are hard.']

# str: with lines from DATA joined with newline (`\n`) character
result = ...

Code 3.26. Solution
"""
* Assignment: String Methods Normalize
* Required: yes
* Complexity: easy
* Lines of code: 4 lines
* Time: 8 min

English:
    1. Use `str` methods to clean `DATA`
    2. Run doctests - all must succeed

Polish:
    1. Wykorzystaj metody `str` do oczyszczenia `DATA`
    2. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is str, \
    'Variable `result` has invalid type, should be str'

    >>> result
    'Pana Twardowskiego III'
"""

DATA = 'UL. pana \tTWArdoWskIEGO 3'

# str: Jana Twardowskiego III
result = ...