Introduction
According to the Encapsulation principle of Object-Oriented Programming, the details of the internal implementation of a class should not be important to the class user. While this is a good idea since it helps the class user focus on the semantics and application of the class's public interface, it should also be noted that knowing how a class is internally implemented impacts the efficiency and scalability of the programs we write.
👉 What is a Python List
Since I expect that you're not a beginner-level programmer, I'll skip the basics.
Asides from the basic and popular descriptions of a Python List object e.g zero-index based, mutable, etc. The closest description I find to give to a Python List object is a referential dynamic array.
Note: Python Lists are not arrays, "dynamic arrays" is only a good description.
- What is an array?
An array is a data structure (DS) that stores and manages a fixed number of elements. What differentiates one DS from the other ( since all DS store and manage a collection of elements ) are mainly;
- How it stores elements in memory.
- Its interfaces and their internal implementation.
- What is a dynamic array?
A dynamic array is an array that is resizeable i.e it allows an increase in its size. For instance, a dynamic array containing names of students may grow from containing names of 5 students to 10, typical arrays do not have an interface for this feature. For a Python List, the increase in size occurs when elements are added to the list using the .append()
or .extend()
methods.
- If arrays have fixed sizes, how then is a dynamic array implemented?
Simply, dynamic arrays do not change the fixed-size characteristic of arrays. At initialization, an array (with a fixed size, say A) is created. The array is usually created such that it can hold more elements than it holds at initialization. This way, if you declare a list with five student names, that list has the capacity to hold more than five names.
However, as more elements are added to the array, the array reaches its maximum capacity. At this point, any further attempt to increase the number of elements contained by the array leads the class to request a new and larger array from the system. This request is carried out internally by a .resize()
method. The old array is copied to the new array ( say B ) such that A's prefix does not change. Note that the new array also has a fixed but larger size. This operation repeats itself again when array B's capacity is exhausted and an attempt is made to increase the number of elements it holds. This is how dynamic arrays work.
Check out the experiment below that shows how the size of a list grows as new elements are added to the list and its capacity gets exhausted.
#code
import sys
data = []
for i in range(30):
length = len(data)
size = sys.getsizeof(data)
print('List address: {0}; Length: {1:3d}; Size in bytes: {2:4d};'.format( id(data),length, size))
data.append(None)
#result
List address: 140479401430208; Length: 0; Size in bytes: 56;
List address: 140479401430208; Length: 1; Size in bytes: 88;
List address: 140479401430208; Length: 2; Size in bytes: 88;
List address: 140479401430208; Length: 3; Size in bytes: 88;
List address: 140479401430208; Length: 4; Size in bytes: 88;
List address: 140479401430208; Length: 5; Size in bytes: 120;
List address: 140479401430208; Length: 6; Size in bytes: 120;
List address: 140479401430208; Length: 7; Size in bytes: 120;
List address: 140479401430208; Length: 8; Size in bytes: 120;
List address: 140479401430208; Length: 9; Size in bytes: 184;
List address: 140479401430208; Length: 10; Size in bytes: 184;
List address: 140479401430208; Length: 11; Size in bytes: 184;
List address: 140479401430208; Length: 12; Size in bytes: 184;
List address: 140479401430208; Length: 13; Size in bytes: 184;
List address: 140479401430208; Length: 14; Size in bytes: 184;
List address: 140479401430208; Length: 15; Size in bytes: 184;
List address: 140479401430208; Length: 16; Size in bytes: 184;
List address: 140479401430208; Length: 17; Size in bytes: 248;
List address: 140479401430208; Length: 18; Size in bytes: 248;
List address: 140479401430208; Length: 19; Size in bytes: 248;
List address: 140479401430208; Length: 20; Size in bytes: 248;
List address: 140479401430208; Length: 21; Size in bytes: 248;
List address: 140479401430208; Length: 22; Size in bytes: 248;
List address: 140479401430208; Length: 23; Size in bytes: 248;
List address: 140479401430208; Length: 24; Size in bytes: 248;
List address: 140479401430208; Length: 25; Size in bytes: 312;
List address: 140479401430208; Length: 26; Size in bytes: 312;
List address: 140479401430208; Length: 27; Size in bytes: 312;
List address: 140479401430208; Length: 28; Size in bytes: 312;
List address: 140479401430208; Length: 29; Size in bytes: 312;
👉How does Python store List class objects and its elements?
From elementary CS, we know that computers store information at its basic level using bits (0's and 1's). These bits are further grouped into larger units known as bytes. Each byte holds 8 bits of information and every computer has a large number of bytes for storing information.
01010101 contains one byte of information
In order to keep track of where every information is stored in memory, computers make use of an abstraction known as a memory address. Without memory addresses, computers would try to access information sequentially, it's like having to open pages 1,2,3,4,...,10 just to get to page 10. With memory addresses, computers can easily access information stored at various memory locations, it's like going directly to page 10. Following this abstraction, we can say that retrieving information stored at any location is an O(1) run-time operation. This is why we refer to the main memory of a computer as Random Access Memory. The image below is a representation of memory locations with their addresses at the top.
The use of memory addresses makes it just as easy to access the information stored at cell A as it is to access the information stored at cell L.
- Why I described Python List objects as "referential"
Computers store related elements (e.g characters in a string, elements of an array) in contiguous memory locations for easy access and clean organization. Also, each cell of memory stores the same number of bytes of information. This contributes to why accessing the elements of a list and character in a string via indexing is an O(1) operation. I explain why in the following paragraphs.
In Python, Unicode characters are stored using 2 bytes. Consequentially, a string containing 5 characters requires 10 bytes of memory. Such a string, say, "Lemon" may be stored using the representation below. Each cell is assumed to hold a byte of information.
This method of storing strings follows the same abstraction as that of storing an array. Since each character is stored using the same number of bytes which also means the same number of cells, to access a desired character store at a particular index, we offset by the number of bytes required to get to our desired location. For example, to access the character at the 4th index from the 0th index, we offset by 8 bytes.
Python lists may hold objects of different types and lengths. A Python List may be declared as shown below.
sample_list = [ 'Bola', 'James', 23, 20645, ['Another', 'list'] ]
To preserve the ability to access the element at a particular index by offsetting by a calculated number of bytes, would mean computers have to give the amount of space occupied by the largest element in the list to every other element. This is inefficient for two reasons
- It would waste memory.
- If an element longer than the longest element in the list is added to the list via the
.append()
or.insert()
method, the abstraction breaks because we would have to reassign a new number of bytes to every element in the list. The thought of this already seems messy.
To fix this problem, a Python List object does not store its elements directly in the contiguous memory locations where the List itself is stored, instead it stores memory addresses of the location where its elements are stored. This way, the O(1) run-time of accessing a list element is preserved since every memory address is stored using the same number of bytes. Since a Python List does not store its elements directly but references to its elements' locations, we say that Python Lists are referential arrays.
👉 some tips on using Python Lists
- Use tuples
If the sequence of objects being stored is not expected to change (e.g the coordinates of a location) you should consider using a tuple. Tuples offer some slight advantage over lists since they're immutable. Firstly, they are more memory efficient than lists, and they provide better time efficiency in cases where you have to look up values in a sequence.
The experiment below shows that tuples are more memory efficient than lists.
import sys
temp_tuple = (1,2,3,4)
temp_list = [1,2,3,4]
tuple_size = sys.getsizeof(temp_tuple)
list_size = sys.getsizeof(temp_list)
print("Tuple size is %d bytes"%tuple_size)
print("List size is %d bytes"%list_size)
#result
Tuple size is 72 bytes
List size is 120 bytes
- Use Sets
If the sequence intended to store only unique values (e.g a sequence storing users with a particular privilege since a user should not appear twice in the sequence.)
- Use lists comprehension.
Instead of appending an element to a list if the element meets a particular criterion, as shown below, use lists comprehension.
#code
temp = []
for i in range(10):
if i%2 == 0:
temp.append(i)
print(temp)
#result
[0, 2, 4, 6, 8]
#using Lists comprehension
temp_2 = [i for i in range(10) if i%2 == 0]
print(temp_2)
#result
[0, 2, 4, 6, 8]
The first method is slower since the interpreter has to work in a loop to determine which elements to add to the list, keep a counter to track which element is being handled and finally perform an additional function look up at every iteration (since append is a list method). Most importantly, using list comprehension makes our code better readable.