Python2.7 to 3.x migration guide, python2.73.x Guide
Currently, all major projects in the Python science stack support both Python 3.x and Python 2.7. However, this situation is coming to an end soon. In last November, a statement from the Numpy team aroused the attention of the data science community: This scientific computing library is about to give up support for Python 2.7 and turn to Python 3. Numpy is not the only tool that claims to abandon the support of earlier versions of Python. Many products, such as pandas and Jupyter notebook, are about to drop the support list. For data science developers, how to switch existing projects from Python 2 to Python 3 has become a major problem. Dr. Alex Rogozhnikov from the University of Moscow compiled a code migration guide for us.
Introduction to Python 3
Python is a mainstream language in machine learning and other scientific fields. We usually need to use it to process a large amount of data. Python is compatible with multiple deep learning frameworks and has many excellent tools for data preprocessing and visualization.
However, Python 2 and Python 3 coexist in the Python ecosystem for a long time, and many data scientists still use Python 2. By the end of 2019, Numpy and many other scientific computing tools will stop supporting Python 2, and all new Numpy versions will only support Python 3 after 2018.
To make the conversion from Python 2 to Python 3 easier, I have collected some Python 3 Functions and hope to be useful to you.
Use pathlib to better process paths
Pathlib is the default module of Python 3, which helps avoid using a large number of OS. path. joins:
from pathlib import Pathdataset = 'wiki_images'datasets_root = Path('/path/to/datasets/') train_path = datasets_root / dataset / 'train'test_path = datasets_root / dataset / 'test'for image_path in train_path.iterdir(): with image_path.open() as f: # note, open is a method of Path object# do something with an image
Python 2 always tries to use string cascade (accurate but not good). Now with pathlib, the code is safe, accurate, and readable.
In addition, pathlib. Path has many methods, so that new Python users do not need to search for each method:
p.exists()p.is_dir()p.parts()p.with_name('sibling.png') # only change the name, but keep the folderp.with_suffix('.jpg') # only change the extension, but keep the folder and the namep.chmod(mode)p.rmdir()
Pathlib saves a lot of time. For details, see:
Documentation: https://docs.python.org/3/library/pathlib.html;
For more information, see https://pymotw.com/3/pathlib /.
Type hinting is part of the language.
Python is not just a language suitable for scripting. The current data process also involves a large number of steps. Each step includes a different framework (and sometimes different logic ).
Type prompts are introduced to Python to help process more and more complex projects, so that machines can better perform code verification. Before that, different modules must use custom methods to specify the type in the document string (Note: PyCharm can convert the old document string to a New Type prompt ).
The following code is a simple example that can process different types of data (this is where we like the Python data stack ).
def repeat_each_entry(data):""" Each entry in the data is doubled<blah blah nobody reads the documentation till the end>"""index = numpy.repeat(numpy.arange(len(data)), 2) return data[index]
The above Code applies to numpy. array (including multi-dimensional), astropy. Table, astropy. Column, bcolz, cupy, mxnet. ndarray, and so on.
This code can also be used for pandas. Series, but the method is incorrect:
repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones inside
This is a two-line code. Imagine how difficult it is to predict the behavior of a complex system. Sometimes a function may lead to wrong behavior. It is helpful to know which types of methods are suitable for large systems. It will give a reminder when the function does not get such parameters.
def repeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]):
If you have a great code library, type prompting tools such as MyPy may become part of the integration process. Unfortunately, the prompt is not powerful enough to provide fine-grained types for ndarrays/tensors, but we may soon have such a prompt tool, which will be a great feature of DS.
Type prompt → runtime type check
By default, function comments do not affect code execution, but they can only help you identify the intent of the Code.
However, you can use tools such as enforce to force type checks during runtime, which can help you debug code (in many cases, type prompts do not work ).
@enforce.runtime_validationdef foo(text: str) -> None: print(text)foo('Hi') # okfoo(5) # fails@enforce.runtime_validationdef any2(x: List[bool]) -> bool: return any(x)any ([False, False, True, False]) # Trueany2([False, False, True, False]) # Trueany (['False']) # Trueany2(['False']) # failsany ([False, None, "", 0]) # Falseany2([False, None, "", 0]) # fails
Other functions of function Annotation
As mentioned above, Annotations do not affect code execution, and provide some metadata, which can be used at will.
For example, the measurement unit is a common problem in the scientific community. The astropy package provides a simple Decorator to control the measurement unit of input and convert the output to the required unit.
# Python 3from astropy import units as u@u.quantity_input()def frequency(speed: u.meter / u.s, wavelength: u.m) -> u.terahertz: return speed / wavelengthfrequency(speed=300_000 * u.km / u.s, wavelength=555 * u.nm)# output: 540.5405405405404 THz, frequency of green visible light
If you have Python table-based scientific data (not too much), try astropy. You can also define the decorator for an application and control/convert the input and output in the same way.
Implement matrix multiplication @
Below, we implement a simple machine learning model, that is, linear regression with L2 regularization:
# l2-regularized linear regression: || AX - b ||^2 + alpha * ||x||^2 -> min# Python 2X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(b))# Python 3X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ b)
The following Python 3 symbol with @ as matrix multiplication is more readable and easier to translate in the deep learning framework: for example, X @ W + B [None,:] the code in numpy, cupy, pytorch, tensorflow, and other libraries represents a single-layer sensor.
Use ** as a wildcard
The wildcards of recursive folders are not very convenient in Python2. Therefore, the custom glob2 module is used to overcome this problem. Recursive flag is supported in Python 3.6.
import glob# Python 2found_images = \ glob.glob('/path/*.jpg') \ + glob.glob('/path/*/*.jpg') \ + glob.glob('/path/*/*/*.jpg') \ + glob.glob('/path/*/*/*/*.jpg') \ + glob.glob('/path/*/*/*/*/*.jpg') # Python 3found_images = glob.glob('/path/**/*.jpg', recursive=True)
In python3, a better choice is to use pathlib:
# Python 3found_images = pathlib.Path('/path/').glob('**/*.jpg')
Print is a function in Python3.
Using Print in Python 3 requires complicated Circular Arc, but it still has some advantages.
Use the simple syntax of the file descriptor:
print >>sys.stderr, "critical error" # Python 2print("critical error", file=sys.stderr) # Python 3
Output the tab-aligned table without str. join:
# Python 3print(*array, sep='\t')print(batch, epoch, loss, accuracy, time, sep='\t')
Modify and redefine the output of the print function:
# Python 3_print = print # store the original print functiondef print(*args, **kargs): pass # do something useful, e.g. store output to some file
In Jupyter, a good thing is to record every document output to an independent document and track the document with the problem when an error occurs, so we can rewrite the print function now.
In the following code, we can use the context manager to temporarily rewrite the print function:
@contextlib.contextmanagerdef replace_print(): import builtins _print = print # saving old print function # or use some other function here builtins.print = lambda *args, **kwargs: _print('new printing', *args, **kwargs) yield builtins.print = _printwith replace_print(): <code here will invoke other print function>
The above is not a recommended method, because it will cause system instability.
The print function can be added to list parsing and other language building structures.
# Python 3result = process(x) if is_valid(x) else print('invalid item: ', x)
F-strings can be used for simple and reliable formatting.
The Default Formatting system provides some flexibility and is not required in Data experiments. However, such code is either too lengthy or fragmented for any modifications. Representative data science requires iterative output of some log information in a fixed format. The code used is usually as follows:
# Python 2print('{batch:3} {epoch:3} / {total_epochs:3} accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}'.format( batch=batch, epoch=epoch, total_epochs=total_epochs, acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies), avg_time=time / len(data_batch)))# Python 2 (too error-prone during fast modifications, please avoid):print('{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format( batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies), time / len(data_batch)))
Sample output:
120 12 / 300 accuracy: 0.8180±0.4649 time: 56.60
F-strings: formatted strings are introduced in Python 3.6:
# Python 3.6+print(f'{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}')
In addition, it is very convenient to write query statements:
query = f"INSERT INTO STATION VALUES (13, '{city}', '{state}', {latitude}, {longitude})"
The obvious difference between "true division" and "integer division"
This change brings convenience to Data Science (but I believe it is not for system programming ).
data = pandas.read_csv('timing.csv')velocity = data['distance'] / data['time']
The result in Python 2 depends on whether the "time" and "distance" (for example, in meters and seconds) are saved as integers.
In Python 3, The result representation is accurate because the division result is a floating point number.
Another case is integer division, which is now used as a definite operation:
n_gifts = money // gift_price # correct for int and float arguments
Note that this operation can be applied to built-in types and custom types provided by data packets (such as numpy or pandas.
Strictly ordered
# All these comparisons are illegal in Python 33 < '3'2 < None(3, 4) < (3, None)(4, 5) < [4, 5]# False in both Python 2 and Python 3(4, 5) == [4, 5]
Prevents unexpected sorting of different types of instances.
sorted([2, '1', 3]) # invalid for Python 3, in Python 2 returns [2, 3, '1']
Helps you discover problems when processing raw data.
NOTE: The proper check for None is (applicable to both versions of Python ):
if a is not None: passif a: # WRONG check for None pass
Unicode for Natural Language Processing
S = 'hello, 'print (len (s) print (s [: 2])
Output:
Python 2: 6 \
Hello, Python 3: 2 \ n.
x = u'со'x += 'co' # okx += 'со' # fail
Python 2 fails here, while Python 3 can work as scheduled (because I used Russian letters in the string ).
In Python 3, strs is a Unicode string, making it easier to process NLP of Non-English text.
There are other interesting aspects, such:
'a' < type < u'a' # Python 2: True'a' < u'a' # Python 2: False
from collections import CounterCounter('Möbelstück')
Python 2: Counter ({'\ xc3': 2,' B ': 1, 'E': 1, 'C': 1, 'K': 1, 'M ': 1, 'L': 1,'s ': 1, 't': 1,' \ xb6 ': 1,' \ xbc': 1 })
Python 3: Counter ({'M': 1, 'ö': 1, 'B': 1, 'E': 1, 'L': 1,'s ': 1, 't': 1, ü': 1, 'C': 1, 'K': 1 })
These can also work correctly in Python 2, but Python 3 is more friendly.
Keep the dictionary and ** kwargs Sequence
In CPython 3.6 +, the default behavior of the dictionary is similar to OrderedDict (which has been guaranteed in version 3.7 + ). This maintains the order of dictionary understanding (and other operations such as during json serialization/deserialization.
import jsonx = {str(i):i for i in range(5)}json.loads(json.dumps(x))# Python 2{u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4': 4}# Python 3{'0': 0, '1': 1, '2': 2, '3': 3, '4': 4}
It also applies to ** kwargs (in Python 3.6 +): they are in the same order as shown in the parameters. When designing data flows, order is crucial. Previously we had to write in this tedious way:
from torch import nn# Python 2model = nn.Sequential(OrderedDict([ ('conv1', nn.Conv2d(1,20,5)), ('relu1', nn.ReLU()), ('conv2', nn.Conv2d(20,64,5)), ('relu2', nn.ReLU()) ]))# Python 3.6+, how it *can* be done, not supported right now in pytorchmodel = nn.Sequential( conv1=nn.Conv2d(1,20,5), relu1=nn.ReLU(),conv2=nn.Conv2d(20,64,5), relu2=nn.ReLU()))
Have you noticed? The uniqueness of the name is also automatically checked.
Iterative unblocking
# handy when amount of additional stored info may vary between experiments, but the same code can be used in all casesmodel_paramteres, optimizer_parameters, *other_params = load(checkpoint_name)# picking two last values from a sequence*prev, next_to_last, last = values_history# This also works with any iterables, so if you have a function that yields e.g. qualities,# below is a simple way to take only last two values from a list *prev, next_to_last, last = iter_train(args)
The default pickle engine provides better compression for arrays.
# Python 2import cPickle as pickleimport numpyprint len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))# result: 23691675# Python 3import pickleimport numpylen(pickle.dumps(numpy.random.normal(size=[1000, 1000])))# result: 8000162
Saves 3 times of space and is faster. In fact, similar compression (but not speed-independent) can be achieved through the protocol = 2 parameter, but users usually ignore this option (or do not know at all ).
More secure Analysis
labels = <initial_value>predictions = [model.predict(data) for data, labels in dataset]# labels are overwritten in Python 2# labels are not affected by comprehension in Python 3
About super ()
The super (...) of Python 2 is a common cause of code errors.
# Python 2class MySubClass(MySuperClass): def __init__(self, name, **options): super(MySubClass, self).__init__(name='subclass', **options)# Python 3class MySubClass(MySuperClass): def __init__(self, name, **options): super().__init__(name='subclass', **options)
Better IDE will provide variable comments
The most enjoyable thing to do when programming in Java, C #, and other languages is that IDE can provide excellent suggestions, because the types of all identifiers are known before code execution.
This is difficult to implement in Python, but annotations can help you:
Write down your expectations in a clear form
Get good suggestions from IDE
This is a PyCharm example with variable annotations. It works even if the function you are using does not contain comments (for example, due to backward compatibility.
Unpacking)
Sample Code that combines two dictionaries in Python3:
x = dict(a=1, b=2)y = dict(b=3, d=4)# Python 3.5+z = {**x, **y} # z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict.
The aame method is valid for list, tuple, and set (a, B, and c are any iteratable objects ):
[*a, *b, *c] # list, concatenating (*a, *b, *c) # tuple, concatenating {*a, *b, *c} # set, union
For * args and ** kwargs, the function also supports additional unpacking:
Python 3.5+do_something(**{**default_settings, **custom_settings})# Also possible, this code also checks there is no intersection between keys of dictionariesdo_something(**first_args, **second_args)
APIS with only keyword Parameters
Let's consider this code snippet:
model = sklearn.svm.SVC(2, 'poly', 2, 4, 0.5)
Obviously, the author of the Code is not familiar with the Python code style (probably just jumped from cpp and rust to Python ). Unfortunately, this is not just a matter of personal preferences, because changing the order of parameters (adding/deleting) in SVC will invalidate the code. In particular, sklearn often resorts or rename a large number of algorithm parameters to provide consistent APIs. Every Refactoring can invalidate the code.
In Python3, database writers may need to use * to explicitly name parameters:
class SVC(BaseSVC): def __init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, ... )
Now, you need to specify the name of the parameter sklearn. svm. SVC (C = 2, kernel = 'poly', degree = 2, gamma = 4, coef0 = 0.5.
This mechanism makes the API both reliable and flexible.
Minor: constant in math module
# Python 3math.inf # 'largest' numbermath.nan # not a numbermax_quality = -math.inf # no more magic initial values!for model in trained_models: max_quality = max(max_quality, compute_quality(model, data))
Minor: Single-precision Integer type
Python 2 provides two basic integer types: int (64-bit signed integer) and long (which is quite puzzling in C ++) for long-time computing ).
Python 3 has a single-precision int, which contains long-time operations.
The following method is used to check whether the value is an integer:
isinstance(x, numbers.Integral) # Python 2, the canonical wayisinstance(x, (long, int)) # Python 2isinstance(x, int) # Python 3, easier to remember
Others
Enums has theoretical value, but string input has been widely used in the python data stack. Enums does not seem to interact with numpy and does not necessarily come from pandas.
Collaboration programs are also very promising for data flows, but there is no large-scale application yet.
Python 3 has a stable ABI
Python 3 supports unicode (So ω = △phi/△t is also okay), but you 'd better use the old ASCII name
Some libraries such as jupyterhub (jupyter in cloud), django, and the new version of ipython only support Python 3. Therefore, useless functions may be useful for libraries that you only want to use once.
Code migration issues unique to Data Science (and how to solve them)
Stop supporting nested parameters:
map(lambda x, (y, z): x, z, dict.items())
However, it still works perfectly for different understandings:
{x:z for x, (y, z) in d.items()}
Generally, it is better to "translate" between Python 2 and Python 3 」.
Map (),. keys (),. values (),. items (), and so on, return the iterator instead of the list. Major problems of the iterator are: There is no trivial division and it cannot be iterated twice. Converting the result to a list solves almost all problems.
How to Use python to teach machine learning and Data Science
Course Authors should first spend time explaining what an iterator is, why it cannot be split, cascade, multiply, or iterated twice as a string (and how to handle it ).
I believe most course authors are happy to avoid these details, but it is almost impossible now.
Conclusion
Python 2 and Python 3 coexist for nearly 10 years. To date, we must say: It's time to switch to Python 3.
Research and Production Code should be shorter, easier to read, and more secure after being migrated to the Python 3 code base.
Currently, most libraries support both 2.x and 3.x versions. However, we should not wait until the popular toolkit stops supporting Python 2. Enjoy the features of the new language in advance.