Introduction#
Previously in part 1, we covered the basics of Python’s built-in data types that are scalar in nature, meaning they represent a single value. Yet, we commonly have to work with many objects and values at once. In part 2, we will explore the sequence and set types that can hold multiple Python objects and values.
| Type | Family | Value | Mutability | Truthiness |
|---|---|---|---|---|
NoneType | None | None | Immutable | Falsy |
bool | Numeric | True, False | Immutable | Truthy, Falsy |
int | Numeric | ℤ = {…, -1, 0, 1, …} | Immutable | 0 → Falsy, Non-zero → Truthy |
float | Numeric | ≈ ℝ | Immutable | 0.0 → Falsy, Non-zero → Truthy |
complex | Numeric | ≈ ℂ | Immutable | 0+0j → Falsy, Non-zero → Truthy |
str | Sequence | … | … | … |
bytes | Sequence | … | … | … |
tuple | Sequence | … | … | … |
list | Sequence | … | … | … |
bytearray | Sequence | … | … | … |
set | Set | … | … | … |
frozenset | Set | … | … | … |
dict | Mapping | … | … | … |
We will start with Python strings, which are used to represent text, and are usually introduced as a scalar type. We intentionally left them out of part 1 because they can also be thought of as a sequence of values that represent characters, which is exactly how Python defines them.
String#
Python sequences represent finite and ordered values. Strings represent text as a sequence of characters. But unlike many other programming languages, Python does not have a separate character type. Instead, Python represents each character as a string of length one.
Under the hood, each character is just a number. You can get the number for a character using the built-in ord function, and you can get the character for a number using the built-in chr function.
>>> channel = "Python In Production"
>>> type(channel)
<class 'str'>
>>> my_char = "A"
>>> type(my_char)
<class 'str'>
>>> len(my_char)
1
>>> ord(my_char)
65
>>> chr(65)
'A'This number is determined by the Unicode standard, which assigns every character a unique number called a Unicode code point, usually shown in hexadecimal notation.
Unicode has space for over a million possible code points, and roughly 300,000 are currently assigned. It covers letters from writing systems around the world, numbers, punctuation, accent marks, mathematical symbols, currency symbols, technical symbols, historical scripts, emoji, control characters like newline and tab, and private-use slots for custom meanings.
>>> japanese_sun = "日"
>>> hex(ord(japanese_sun))
'0x65e5'
>>> number_as_string = "1"
>>> hex(ord(number_as_string))
'0x31'
>>> hex(ord("¡"))
'0xa1'
>>> hex(ord("é"))
'0xe9'
>>> hex(ord("∑"))
'0x2211'
>>> hex(ord("€"))
'0x20ac'
>>> hex(ord("⌘"))
'0x2318'
>>> hex(ord("𓀀"))
'0x13000'
>>> hex(ord("🐍"))
'0x1f40d'
>>> hex(ord("\n"))
'0xa'
>>> hex(ord("\t"))
'0x9'
>>> hex(ord("")) # renders on Apple devices
'0xf8ff'Since strings are sequences of characters, indexing and slicing them gives you back a string. This is also why you can iterate over a string and get back one-character strings.
>>> channel = "Python In Production"
>>> channel[0]
'P'
>>> type(channel[0])
<class 'str'>
>>> channel[:6]
'Python'
>>> for char in channel:
... print(char.upper(), end=" ")
P Y T H O N I N P R O D U C T I O NStrings are immutable, so any string operation creates a new string object in memory.
>>> playlist = channel.replace("In", "Before")The empty string "" is the only string of length zero, and also the only falsy string.
>>> len("")
0
>>> bool("")
FalseBytes#
Strings are closely related to another immutable sequence type called bytes, which represents sequences of bytes, or 8-bit integers. They are often used when working with binary data, like files, network messages, or encoded text.
You can create a bytes object from a string using the encode() method. We will not go into the details of the different encoding methods, but the default Python uses is UTF-8, a scheme for mapping Unicode code points to bytes.
You can also write a bytes object directly as a literal using a string with a leading b.
Note that a bytes literal can only contain ASCII characters in the source, which are the plain English letters, digits, and common punctuation. For anything else, such as accented letters, emoji, or non-Latin scripts, you must use the encode() method or the bytes() constructor.
The length of a bytes object is the number of bytes, which may differ from the number of characters in the original string if it contains non-ASCII characters. For ASCII text, all three approaches to creating a bytes object produce equal results.
>>> data = "Python In Production".encode(encoding="utf-8")
>>> type(data)
<class 'bytes'>
>>> data_from_b_literal = b"Python In Production"
>>> cannot_use_non_ascii = b"café"
File "<stdin>", line 1
cannot_use_non_ascii = b"café"
^^^^^^^
SyntaxError: bytes can only contain ASCII literal characters
>>> non_ascii_data = "café".encode(encoding="utf-8")
>>> non_ascii_data
b'caf\xc3\xa9'
>>> len("café")
4
>>> len(non_ascii_data)
5
>>> data_from_constructor = bytes("Python In Production", encoding="utf-8")
>>> data == data_from_b_literal == data_from_constructor
TrueWe saw that indexing or slicing a str object gives you back a str object. This is not the case for bytes. Indexing a bytes object gives you back an int representing the byte value at that position, while slicing a bytes object gives you back another bytes object.
>>> data = b"Hello, World!"
>>> data[0]
72
>>> type(data[0])
<class 'int'>
>>> data[:5]
b'Hello'
>>> type(data[:5])
<class 'bytes'>
>>> for byte in data:
... print(byte, end=" ")
72 101 108 108 111 44 32 87 111 114 108 100 33The empty bytes object b"" is the only bytes object of length zero, and also the only falsy bytes object.
| Type | Family | Value | Mutability | Truthiness |
|---|---|---|---|---|
str | Sequence | sequences of Unicode code points | Immutable | "" → Falsy, Non-empty → Truthy |
bytes | Sequence | sequences of bytes | Immutable | b"" → Falsy, Non-empty → Truthy |
Tuple#
Strings are immutable sequences of characters, which are all of the same string type. Bytes are immutable sequences of bytes, which are all of the same integer type. What if we want to have an immutable sequence of Python objects that can be of other types or even a mix of types?
Python tuples can do that. They are often used to represent fixed collections of objects that belong together, like the coordinates of a point or the RGB values of a pixel. Because they are sequences, like strings and bytes, they also support indexing and slicing, and you can iterate over them.
>>> point = (1, 2)
>>> type(point)
<class 'tuple'>
>>> color = (255, 0, 0)
>>> mixed_and_nested_tuple = (None, False, True, (2, 3.0), "four")
>>> for value in mixed_and_nested_tuple:
... print(type(value))
<class 'NoneType'>
<class 'bool'>
<class 'bool'>
<class 'tuple'>
<class 'str'>You can create an empty tuple with empty parentheses or using the tuple() constructor. A small gotcha for creating a tuple with one element is that you need to include a trailing comma; otherwise, it will be interpreted as the value itself. In fact, the parentheses are optional for creating non-empty tuples, and the comma is what actually creates the tuple.
>>> empty_tuple = ()
>>> also_empty_tuple = tuple()
>>> not_a_tuple = (1)
>>> type(not_a_tuple)
<class 'int'>
>>> tuple_with_one_element = (1,)
>>> type(tuple_with_one_element)
<class 'tuple'>
>>> another_tuple_with_one_element = 1,
>>> type(another_tuple_with_one_element)
<class 'tuple'>Tuples are immutable, so you cannot change their contents after they are created. You cannot add, remove, or modify elements in a tuple.
>>> point = (1, 2)
>>> point[0] = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
point[0] = 3
~~~~~^^^
TypeError: 'tuple' object does not support item assignmentLike other sequences, the empty tuple is the only tuple of length zero, and also the only falsy tuple.
| Type | Family | Value | Mutability | Truthiness |
|---|---|---|---|---|
str | Sequence | sequences of Unicode code points | Immutable | "" → Falsy, Non-empty → Truthy |
bytes | Sequence | sequences of bytes | Immutable | b"" → Falsy, Non-empty → Truthy |
tuple | Sequence | sequences of Python objects | Immutable | () → Falsy, Non-empty → Truthy |
List#
So far, we only looked at immutable data types. Our first mutable sequence is the Python list. Lists can hold any type of Python objects. So, they are like tuples, but with the added ability to change their contents after creation without creating a new object in memory. You can use mutable list methods like append() and insert() to modify them in place. They are often used to represent collections of items that are dynamically changing. They also support indexing, slicing, and iteration just like other sequences.
>>> my_list = [True, "Two", 3]
>>> my_list.append(4.0)
>>> my_list.insert(0, False)
>>> my_list
[False, True, 'Two', 3, 4.0]
>>> my_list[0] = 1
>>> my_list[4] = 5
>>> for item in my_list:
... print(item, end=" ")
1 True Two 3 5A common source of confusion is that the mutability or immutability of an object is a shallow property, meaning it only applies to the collection itself, not to the objects it contains. That’s why you can have a list that contains immutable objects, like we just saw, or a tuple that contains mutable objects like a list.
If you have a tuple that contains a list, you can modify the list inside the tuple, since a list itself is mutable. But you cannot change the tuple itself to refer to a different list or any other object.
>>> my_tuple = (1, 2, [3, 4])
>>> my_tuple[2].append(5)
>>> my_tuple
(1, 2, [3, 4, 5])
>>> my_tuple[2] = [6, 7]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
my_tuple[2] = [6, 7]
~~~~~~~~^^^
TypeError: 'tuple' object does not support item assignmentBytearray#
Loosely speaking, a list is a mutable version of a tuple, whereas bytearray is a mutable version of bytes. You can create a bytearray from a bytes object, from a string with an encoding, or from an iterable of integers from 0 to 255, all using the bytearray() constructor. Unlike bytes, bytearray has no literal syntax.
>>> immutable_data = b"Hello, World!"
>>> question_mark = ord("?")
>>> immutable_data[-1] = question_mark
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
immutable_data[-1] = question_mark
~~~~~~~~~~~~~~^^^^
TypeError: 'bytes' object does not support item assignment
>>> mutable_data = bytearray(immutable_data)
>>> type(mutable_data)
<class 'bytearray'>
>>> mutable_data[-1] = question_mark
>>> mutable_data
bytearray(b'Hello, World?')
>>> mutable_data.append(question_mark)
>>> mutable_data
bytearray(b'Hello, World??')That covers both immutable and mutable sequence types. The set types come next.
| Type | Family | Value | Mutability | Truthiness |
|---|---|---|---|---|
str | Sequence | sequences of Unicode code points | Immutable | "" → Falsy, Non-empty → Truthy |
bytes | Sequence | sequences of bytes | Immutable | b"" → Falsy, Non-empty → Truthy |
tuple | Sequence | sequences of Python objects | Immutable | () → Falsy, Non-empty → Truthy |
list | Sequence | sequences of Python objects | Mutable | [] → Falsy, Non-empty → Truthy |
bytearray | Sequence | sequences of bytes | Mutable | bytearray() → Falsy, Non-empty → Truthy |
Set#
Python sets represent unordered collections of unique values. So, unlike sequences, they cannot be indexed or sliced. However, you can still iterate over them and use the built-in len() function to get the number of elements in a set.
Because there’s no beginning or end to a set, we don’t append or insert elements at a specific position. Instead, we add elements to a set using the add() method and remove elements using the remove() method.
>>> my_set = {1, 2, 3}
>>> for item in my_set:
... print(item, end=" ")
1 2 3
>>> len(my_set)
3
>>> my_set.add(4)
>>> my_set
{1, 2, 3, 4}
>>> my_set.remove(3)
>>> my_set
{1, 2, 4}Sets are often used for fast membership testing, like we saw in part 1, and for removing duplicates from a sequence. They are also useful for performing mathematical set operations like union, intersection, difference, and symmetric difference.
>>> my_list = [1, 1, 2, 3, 5, 8]
>>> unique_values = set(my_list)
>>> unique_values
{1, 2, 3, 5, 8}Frozenset#
Set types also include frozenset, which is an immutable version of a set that you cannot modify after creation. You can create a frozenset from any iterable using the frozenset() constructor.
>>> my_frozenset = frozenset([1, 2, 3])
>>> type(my_frozenset)
<class 'frozenset'>
>>> my_frozenset.add(4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
my_frozenset.add(4)
^^^^^^^^^^^^^^^^
AttributeError: 'frozenset' object has no attribute 'add'Hashing#
We mentioned in part 1 that sets are dramatically faster than lists for membership testing. The reason is that sets are implemented as hash tables, which allow for constant average time complexity for membership testing.
When you check if an element is in a set, Python computes the hash of the element and uses it to quickly find the corresponding bucket in the hash table where the element would be stored. If the bucket is empty, the element is not in the set. If the bucket contains one or more elements, Python compares the element with each of them to see if it matches.
Python integers generally have a hash value equal to their value. There are exceptions, though. For example, the hash of -1 is set to -2, to avoid the special return value of -1 that the hash function uses to indicate an error. As a side effect, -1 and -2 are a pair of integers that share the same hash. So, they will always collide in a hash table if both are present, but this is not a problem since they are different values and will be compared for equality when they end up in the same bucket.
>>> hash(42)
42
>>> hash(2026)
2026
>>> hash(-1)
-2
>>> hash(-2)
-2Hashability has two requirements: a hash value that does not change during the object’s lifetime, and the ability to be compared to other objects for equality. This is why mutable types like lists and sets are not hashable, while immutable types like tuples and frozensets are hashable. So, you can have a tuple or list that contains a list, but you cannot have a set or a frozenset that contains a list.
>>> my_tuple = (1, 2, [3, 4])
>>> my_set = {1, 2, [3, 4]}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
my_set = {1, 2, [3, 4]}
^^^^^^^^^^^^^^
TypeError: cannot use 'list' as a set element (unhashable type: 'list')Let’s add one more column to our data types table for hashability even though it can be inferred from the mutability here.
| Type | Family | Value | Mutability | Truthiness | Hashable |
|---|---|---|---|---|---|
str | Sequence | sequences of Unicode code points | Immutable | "" → Falsy, Non-empty → Truthy | Hashable |
bytes | Sequence | sequences of bytes | Immutable | b"" → Falsy, Non-empty → Truthy | Hashable |
tuple | Sequence | sequences of Python objects | Immutable | () → Falsy, Non-empty → Truthy | Hashable |
list | Sequence | sequences of Python objects | Mutable | [] → Falsy, Non-empty → Truthy | Not Hashable |
bytearray | Sequence | sequences of bytes | Mutable | bytearray() → Falsy, Non-empty → Truthy | Not Hashable |
set | Set | unordered collections of unique values | Mutable | set() → Falsy, Non-empty → Truthy | Not Hashable |
frozenset | Set | unordered collections of unique values | Immutable | frozenset() → Falsy, Non-empty → Truthy | Hashable |
Performance#
If we revisit the membership testing benchmarks from part 1, something like the following happens at an intuitive level. When we check whether 500_000 is in a set of a million integers, Python computes the hash of 500_000, which is itself, and then takes it modulo some k to find the corresponding bucket in the hash table. If k were equal to 250_000, for instance, then 0, 250_000, 500_000, and 750_000 would all map to the same bucket.
Python then checks the entries in that bucket one by one and finds 500_000. On top of a quick hash lookup, we scan at most four elements, compared to up to 500,000 in a list.
>>> my_set = set(range(1_000_000))
>>> 500_000 in my_set
TrueBut lookup is only half the story. Building a set isn’t free, since every element gets hashed on the way in, which takes longer than building a list. Let’s have our benchmarks also report the build time and compare.
PRESENT
n lookup list lookup set set quicker build list build set
10 34 ns 9 ns ~4x 37 ns 96 ns
100 257 ns 9 ns ~30x 120 ns 798 ns
1,000 2.30 us 15 ns ~150x 938 ns 6.0 us
10,000 23.40 us 14 ns ~1,500x 15.2 us 64.9 us
100,000 247.00 us 14 ns ~15,000x 178.1 us 677.2 us
1,000,000 2.28 ms 14 ns ~150,000x 2.04 ms 7.45 ms
10,000,000 23.88 ms 14 ns ~1,500,000x 22.57 ms 73.84 msMISSING
n lookup list lookup set set quicker build list build set
10 59 ns 8 ns ~7x 34 ns 97 ns
100 521 ns 8 ns ~60x 136 ns 794 ns
1,000 5.00 us 8 ns ~600x 942 ns 6.0 us
10,000 50.20 us 8 ns ~6,000x 16.2 us 65.8 us
100,000 507.50 us 8 ns ~60,000x 183.1 us 686.4 us
1,000,000 5.08 ms 8 ns ~600,000x 2.03 ms 7.44 ms
10,000,000 51.60 ms 8 ns ~6,000,000x 22.49 ms 74.12 msBuilding a set costs about three to four times what building a list does, and roughly the same as one full list scan against a missing target. Both touch every element. The set adds a hash on the way in.
So if we have a list and we only need to check membership once or twice, converting to a set is roughly a wash. But the moment we do it more than a few times, the set starts winning.
Where exactly is the break-even, the number of lookups at which converting to a set pays off?
PRESENT
n lookup list lookup set set quicker build list build set breakeven
10 34 ns 9 ns ~4x 37 ns 96 ns 3.9
100 257 ns 9 ns ~30x 120 ns 798 ns 3.2
1,000 2.30 us 15 ns ~150x 938 ns 6.0 us 2.6
10,000 23.40 us 14 ns ~1,500x 15.2 us 64.9 us 2.8
100,000 247.00 us 14 ns ~15,000x 178.1 us 677.2 us 2.7
1,000,000 2.28 ms 14 ns ~150,000x 2.04 ms 7.45 ms 3.3
10,000,000 23.88 ms 14 ns ~1,500,000x 22.57 ms 73.84 ms 3.1MISSING
n lookup list lookup set set quicker build list build set breakeven
10 59 ns 8 ns ~7x 34 ns 97 ns 1.9
100 521 ns 8 ns ~60x 136 ns 794 ns 1.5
1,000 5.00 us 8 ns ~600x 942 ns 6.0 us 1.2
10,000 50.20 us 8 ns ~6,000x 16.2 us 65.8 us 1.3
100,000 507.50 us 8 ns ~60,000x 183.1 us 686.4 us 1.4
1,000,000 5.08 ms 8 ns ~600,000x 2.03 ms 7.44 ms 1.5
10,000,000 51.60 ms 8 ns ~6,000,000x 22.49 ms 74.12 ms 1.4In the worst case, the break-even sits at one to two lookups. In the average case, it climbs to about three. Either way, it’s roughly the same regardless of how big the collection is.
So should we always convert our lists to sets after a couple of lookups?
No, definitely not. If we wanted to extract every nanosecond, we would not be writing Python, and the Zen of Python says:
Beautiful is better than ugly. Simple is better than complex. Readability counts.
We can still convert a list to a set when we have a real reason. Or, if performance is really a bottleneck where Python can’t help, we can always rewrite that part in C or Rust, and call it from Python. This is why numerical computing libraries like NumPy are written in C and provide Python bindings. They can perform heavy computations much faster than pure Python code while still being relatively easy to use from Python.
But never forget that premature optimization is the root of all evil.
Conclusion#
In the first two parts, we covered the scalar, sequence, and set data types. In the next and final part of our data types chapter, we will cover the mapping types: dictionaries and frozen dictionaries. Dictionaries are arguably the most important data structure in Python, since the language itself is built on top of dictionary-like structures, and they are used heavily throughout most Python code. See you there!
