This is a creation in Article, where the information may have evolved or changed.
This is a (long) blog post that introduces our experience in migrating a lot of Python/cython code to the go language in repustate. If you want to know the whole story, background and all the things, please read on. If you just want to know what Python developers need to know before they dive into the go language, click the link:
Suggestions for migrating from Python to go (tips & Tricks)
Background
In Repustate, one of the best technical achievements we've ever accomplished is the realization of an Arabic sentiment analysis. The Arabic language is a difficult nut, because its morphological changes are rather complicated. It is also more difficult to divide a sentence into several separate words than English, because the Arabic word itself may also contain whitespace characters (for example, "arieful" in a word). It's not a leak, repustate. Use support vector machines (SVM) to get the most likely meaning behind a sentence and add emotional elements to it. In general, we use 22 models (22 SVM) and in one document we will analyze each word. So if you have a 500-word document, then based on SVM, there will be a 100,000-time comparison.
Python
Repustate is almost entirely a Python store. We use Django to implement APIs and websites. So (currently) it makes sense to use Python to implement the Arabic emotion engine in order to keep the code consistent. Python is a good choice just for prototyping and implementation. Its expressive ability is very strong, third-party class library and so on is also very good. If you're just for Web services, Python is perfect. But when you do a low-level calculation that relies heavily on the hash table (the dictionary type in Python), everything slows down. We can process about two to three Arabic documents per second, but it's too slow. By comparison, our English affective engine can process about 500 documents per second.
Bottleneck
So we opened up the Python Profiler and started investigating where it took so long. Remember what I said earlier that we have 22 SVM and that each word needs to be processed? Well, these are all linear, non-parallel processing. So our first reaction was to change the linear process to map/reduce. To put it simply: Python is not suitable for use as a map/reduce. Python is handy when you need concurrency. At the Python Conference (translator: Pycon), Guido talked about Tulip, his new project is making up for Python's shortcomings, but it will take some time to launch, but if there is something better to use, why should we wait?
Choose Go language, or go home?
My friends at Mozilla have told me that Mozilla is switching its vast base log architecture to go language, partly because of its powerful [goroutines]. The go language is designed by Google people and is designed to support concurrency as a first priority, rather than as a Python solution that was added afterwards. So we started to change Python into the Go language.
Although the Go code is not officially on-line, the results are very encouraging. We can now process 1000 documents per second, use less memory, and not debug what you encounter in Python: Ugly multi-process/gevent/"Why control-c can't kill the process".
Why we like the Go language
Anyone, with a little understanding of how programming languages work (interpreted vs compiled, dynamic language vs static languages), would say, "cut, of course the Go language will be faster". Yes, we can also rewrite everything in Java and see similar improvements, but that's not the reason for the Go language to win. The code you write with Go seems to be right. I'm not sure what the hell is going on, but once the code is compiled (quickly compiled), you'll feel the code works (not just running, not wrong, and even logically). I know, it doesn't sound very plausible, but it does. This is very similar to Python in terms of redundancy (or non-redundancy), which takes functions as the first goal, so function programming is easy to understand. And of course, the go thread and channel make your life easier, you can get the performance boost of static type, and you can control the memory allocation more finely, but you don't have to pay too much for the language expression power.
Things I want to know earlier (Tips & Tricks)
After all these compliments, sometimes you really need to change your way of thinking relative to Python when you're dealing with Go code. So this is the list of notes that I recorded when I migrated the code--just a random idea from my mind when I switched the Python code to go:
- No built-in collection type (must use map and check for presence)
- Because there is no collection, you must write your own intersection, and the set of methods
- Without the tuples type, you must write your own structure, or use slices (that is, an array)
- Without a method like \__getattr__ (), you must always check for existence, rather than setting default values, for example, in Python, you can write value = Dict.get ("A_key", "Default_value")
- must always check for errors (or explicitly ignore errors)
- Cannot have variables/packages not being used, so simple tests also need to sometimes put out some code
- Converts between [] byte and string. RegExp uses [] byte (immutable). That's right, but it's annoying to change some variables to convert.
- Python is more lenient. You can use an out-of-range index to take a fragment in a string without error. You can also use negative numbers to remove clips, but Go can't
- You cannot mix data structure types. It may not be so clean, but sometimes in Python, I use a dictionary of strings and lists mixed with values. But Go no, you have to clean up your data structure or use a custom structure
- Cannot unpack a tuple or list to several different variables (for example: X, y, z = [1, 2, 3])
- Hump-style naming style (if you don't have the first capitalization method name/struct name, they won't be exposed to other packages). I prefer Python's lowercase letter and underline naming style.
- You must explicitly check if there is an error! = nil, unlike in Python, many types can be checked like bool (0, "", none can be interpreted as a "non" collection)
- The document is too fragmented on some modules, for example (CRYPTO/MD5), but the go-nuts on IRC is very useful and provides great help.
- Conversions from numbers to strings (Int64-string) and []byte-string (as long as you use String ([]byte)) are not the same. Need to use StrConv.
- Reading the go code is more like a programming language than Python, which has more non-alphanumeric characters and uses | | And &&, rather than "or" and "and"
- To write a file, there are File.write ([]byte) and File.writestring (string), which is contrary to Python's way of Python: "Solving a problem is a way".
- Modifying a string is difficult, and you must frequently rearrange the FMT. Sprintf
- There is no constructor, so the idiom is to create the NewType () method to return the structure you want
- else (or else if) must be formatted correctly, and else the curly braces on the if pair are on the same line. Strange.
- Assignment operators depend on or outside the function, for example, = and: =
- If I only want to "key" or just Want "value", such as: Dict.keys () or dict.values (), or a tuples list, for example: Dict.items (), there is no equivalent in the Go language, you can only enumerate the map yourself. To construct your list type
- I sometimes use a idiom: to construct a value that is the dictionary type of a function, and I want to invoke these functions with the given key values, which you can do in Go, but all the functions must accept, return the same thing, for example: the same method signature
- If you use JSON and your JSON is a composite type, congratulations. You must construct a custom structure that matches the format in the JSON block, and then parse the original JSON into an instance of your custom structure. More work to do than Object = Json.loads (Json_blob) in the Python world
Isn't it worth it?
Worth, 1 million times times the worth. The speed of Ascension is so much that it is difficult to abandon. At the same time, I think go is the current trend, so when recruiting new employees, I think it would be helpful to take go as an important part of repustate technology.