As Python is one of the top prerequisites to learning Machine Learning, there is a huge influx of programmers/scientists from a different background.
Writing code in Python is easy. Few lines of code and it’s working! However, Python is slower compared to other languages. There are multiple ways we can optimize to run the loops a little faster and utilize less memory.
Code profiling (both memory and time)
Cython
Object-oriented programming with design patterns
Generators
Context managers
Multiprocessing
Using slots operator
Using Pythonic code
Preloading memory intensive operations
Dead code removal
Modularization
Disabling unnecessary print statements
Code profiling
To optimize a program we need to find the bottlenecks. There are many libraries available to profile a code. Some of the libraries which we used are cProfile, line_profiler, pycallgraph. Other than these we used the %%timeit operator in Jupyter notebook and memory_profiler. Memory profiler is used to keep the RAM usage under limit.
Cython
Cythonizing the python files will make them run faster. [Reference]
Object-oriented programming with design patterns
Though python can be used as a functional programming language. For a production level code design patterns need to be followed for future support and enhancements. [Reference][Reference]
Dead code removal
Dead codes which were written for some operation but are not being used currently increases the debugging and profiling time. These codes need to be removed first.
Generators
Python generators are one of the best things to handle large data set processing with limited memory usage during runtime. [Reference]
Context managers
Context managers allow you to allocate and release resources precisely when you want to. Opening files and closing automatically can be done with the help of these. No need to explicitly close the files.
Multiprocessing
Multiprocessing module in python can be used to run independent processes in parallel. It can be used for utilizing the CPU of the server.
Using slots operator [Reference]
Object attributes of a class are stored in dict.
Dictionaries are used for high access speed, O(1). However, in most cases dict becomes 1/3rd empty.
from sys import getsizeof as gs
d_obj = {}
print(gs(d_obj)) *# print the default size*
*# 280*
d_obj = {k: v for k, v in enumerate(range(5))}
print(gs(d_obj)) *# default size continues till this*
*# 280*
d_obj = {k: v for k, v in enumerate(range(6))}
*# a new set of memory allocation begin where most of the memory are unutilized*
print(gs(d_obj))
*# 1048*
Details of other data types can be found below: [Source]
Bytes type empty + scaling notes
24 int NA
28 long NA
37 str + 1 byte per additional character
52 unicode + 4 bytes per additional character
56 tuple + 8 bytes per additional item
72 list + 32 for first, 8 for each additional
232 set sixth item increases to 744; 22nd, 2280; 86th, 8424
280 dict sixth item increases to 1048; 22nd, 3352; 86th, 12568 *
64 class inst has a __dict__ attr, same scaling as dict above
16 __slots__ class with slots has no dict, seems to store in
mutable tuple-like structure.
120 func def doesn't include default args and other attrs
904 class def has a proxy __dict__ structure for class attrs
104 old class makes sense, less stuff, has real dict though.
- slots can be used to get rid of these free memory spaces. Here you can not add more attributes once you define slots. By this way, you can reduce the memory usage of the object(s).
Using Pythonic code
There are many cases where using Pythonic codes improve the performance. Some examples are:
Using list/generator expressions over for loop [Reference][Reference]
Using short-circuit operator to avoid additional if-else condition check. [Reference]
Using Python libraries than re-inventing the wheel
Using local function variables instead of global variables [Reference][Reference]
Using decorators to fetch the time taken on function level and disabling them in prod run
Converting numeric intensive operations to Numpy
Using enumerate instead of maintaining a separate count for indexes
Pre-loading memory/time intensive operations
There are memory and time intensive operations which can be done during importing the libraries before running the server. In this way, the query processing time decreases.
Modularization
Modularization is always good. It is necessary for a maintainable code, quickly editable for ever-changing business requirements. If a function is called for more than 20 times then you each second saved multiplies that many times saved.
Disabling print statements
A separate function is used to print the operations. However, those can be disabled during the production run in a single flag change. Now the program can run as a service and log files can be used for operation monitoring.
Technology stacks for performance
Nginx for load balance between multiple servers
Redis as the cache manager
RabbitMQ or Apache Kafka for job queue management
CherryPy as WSGI server in Windows OS
uWSGI in Linux OS
Flask for REST API development
Conclusion
After implementing the above methods the concurrency user support has been increased more than 10 times. With implementing all the methods described here we target to increase the concurrency support by at least 100 times.
Additional optimization tips and references:
Thank you for your time.