Computer Programming Languages: Why C Runs So Much Faster Than Python

This post was published on the now-closed HuffPost Contributor platform. Contributors control their own work and posted freely to our site. If you need to flag this entry as abusive, send us an email.
Getty Images

Why is python slower than C? originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by Terry Lambert, Apple Core OS Kernel Team; technical lead on several projects over 8 years, on Quora:

Python is slower than C because it is an interpreted language.

This amplifies the number of actual CPU instructions required in order to perform a given statement.

In a Python program, you may add the value 1 to a variable, or compare the value of a variable to see whether it is less than a certain value, greater than a certain value, or precisely equal to a certain value, such as in:

In assembly language, you can similarly do this loop:

The difference is that the python code will be interpreted, instead of directly by the CPU.

This makes all the difference in the world, with regard to performance.

Python code almost always runs in a virtual machine.

(I say “almost” because it doesn’t have to, but except under really limited circumstances — no, actually, it really has to)

Another name for a virtual machine is a “bytecode interpreter”.

Interpreted code is always slower than direct machine code, because it takes a lot more instructions in order to implement an interpreted instruction than to implement an actual machine instruction.

Example time! Yay!

For example, let’s take the x += 1. In an Intel CPU, an increment of a register is a single op, has a latency of 1, and a reciprocal throughput value of 1/3.

In other words: it’s about the fastest CPU instruction it’s possible to have on an Intel processor.

How is this x += 1 accomplished in Python?

In order to know this, you have to know how Python works internally.

Internally, Python is composed of a tokenizer, a lexical analyzer, a bytecode generator, and a bytecode interpreter:

  • Tokenizer This converts input Python code (ASCII text files) into a token stream
  • Lexical Analyzer This is the part of Python that cares all about those meaningful spaces and indentation. This is where syntax checking happens.
  • Bytecode Generator This is the Part of Python that does the optimizations, if any; because Python is not actually a compile language, the range of optimizations is limited, compared to the optimizations which you might get out of a C compiler.
  • Bytecode Interpreter This is the part of Python that operates on the bytecode stream, and maintains the state of the Python virtual machine.

Bytecode, once generated is typically cached in memory.

This provides a speed improvement, because it means you can avoid repeating the tokenization, lexical analysis, and bytecode generation steps for code that Python has already seen.

So when we iterate our while loop, we can skip the tokenization an lexical analysis and bytecode generation steps, and just hand the bytecode off to the bytecode interpreter, again and again.

This is fast, right?

Actually, no.

While it is faster to use cached bytecode, this is not the same thing as operating as quickly as machine code.

A virtual machine is not the actual CPU on which the code is running.

A short intro to virtual machines.

One of the earliest used virtual machines was called the UCSD p-System, from the University of California, San Diego. It was around in 1972.

Shortly afterward, Microsoft released their version of BASIC (based on Dartmouth University BASIC from 1964), which tokenized the BASIC code in much the same way that Python tokenizes today. BASIC is stored in memory the same way Python bytecode is stored in memory following tokenization, although BASIC delays some of the lexical analysis stage until runtime. In its virtual machine: the BASIC interpreter.

Other than not having line numbers for each line, and using spacing as a statement block management technique, instead of not having one and using GOTO instead, Python largely resembles the BASIC interpreters we had 40 years ago.

“Compiling” vs. compiling.

Compiled UCSD Pascal was not compiled to assembly language, like in other compiled languages at the time.

Instead, it was compiled into p-Code.

Apple, where I worked, had the only non-revokable license to the UCSD p-Code system. This was a licensing mistake which UCSD did not later repeat.

Most of ProDOS on the Apple II was written in Pascal, and almost all code to do QuickDraw in the early Macintoshes was written in Pascal.

So when you think of a “compiled Pascal program”, at the time, you were talking about p-Code. Or “bytecode”, if you are a Java or Python fan, and want to pretend you invented something new.

Python also has the concept of “compiled Python”; this is Python code that has gone through the tokenizer, the lexical analyzer, and the bytecode generator, to get what you’d have in memory as cached bytecode, which was ready to feed to the bytecode interpreter (AKA the Python Virtual Machine).

Whenever you see a file that ends in .py, this is an ASCII text file that contains Python source code.

When you see a file that ends in .pyc, this is “PYthon, Compiled”.

The resulting code still runs in a virtual machine.

Native code.

A program isn’t really native code until it’s been compiled, and a program hasn’t actually been compiled to native code until it’s compiled to the native binary CPU instructions on the platform it targets.

This normally involves generating assembly code instead of bytecode, passing the assembly code to an assembler, and the assembler spitting out platform specific object files.

After that, the program is still not ready to run, until it’s linked to a platform runtime. A runtime sets up the environment in which the code expects to run, and can provide a number of runtime services such as dynamic object loading. Compiled C has a runtime. Compiled C++ has a runtime.

Mostly, these runtimes just set up the environment, and jump to your code. In other words: they have a one-time cost.

The process of associating a generated object file with a runtime (and any support libraries) to generate a standalone executable is called “linking”.

Virtual machines do not do linking; the closest they get is loading additional modules into the virtual machine. While these modules can themselves be compiled — in fact, this is the basis of some python modules to speed operations in Python up, and it’s the basis of the JNI (Java Native Interface) — the modules are generally written in a compiled language, such a C or C++, or if they are something like a math library, they might even be written in assembly.

Something isn’t compiled, then, unless it becomes a native binary for the platform it’s running on.

Why running in a virtual machine makes you slow.

Running in a virtual machine requires interpreting a bytestream, and then running native code on behalf of the bytestream.

So let’s go back to our example:

We’re just adding one, right?

What actually happens under the covers to implement this in Python, even after it has been converted to bytecode, is (according to my profiler) 378 machine instructions executed by the Python virtual machine.

Doing the same thing in C:

costs 1 machine instruction, because C compiles that down to assembly language, and that assembly language looks like this:

So you end up doing a ton of work to accomplish something, when just a little work would do the job.

Why Python itself is intrinsically slow on top of that.

CPython is the C implementation of Python, which is the default Python implementation utilized practically everywhere.

CPython interpreters can not take substantial advantage of multithreading.

Why is this?

CPython has what’s called a global interpreter lock. This is utilized by the interpreter to ensure that there’s only one thread executing Python bytecode at a time. If a thread calls into an external module, or it blocks, then another Python thread is able to run.

You may have 5,000 cores in your machine, but you’re only going to be running Python on one of them at a time.

External modules written in something other than Python can provide substantial performance improvements, because they can use this multithreading — this is called Apartment Model Threading, and it was “invented” by Microsoft in 1996 or so as part of their ViPER framework, which was their way of adding multithreading support in NT 3.51.

Once the CPython thread goes into the “Apartment” (module), it drops the global interpreter lock, and can use multiple threads in the C/C++/other code all it wants — until its time for it to come back out of the apartment.

Like a human who enters their own apartment: it can take it’s clothes (the lock) off, and dance around naked, if it wants.

Making Python faster.

Java has a technology called JIT, or Just In Time compilation. What a JIT does is takes the bytecode, and converts it into native assembly language for the platform it’s running on.

Another name for this technology is dynamic compilation.

What a JIT does is do the missing pieces between the bytecode and the assembly code; it converts the bytecode into assembly code, and then caches that for re-execution.

A JIT can be nearly as fast a a natively compiled program, but has runtime overhead for the conversion, every time you run it. It’s an incremental overhead, that results in the code not being as fast as it could be.

To combat this, you probably want to compile Java to native code; there’s a commercial product, called Excelsior JET, that you can buy to do this. There’s also the old GCJ (GNU Compiler for Java), which has been removed. You can find it in archives, but the version will not support most recent versions of Java. It was discontinued after the Oracle license change, on GNU philosophical grounds.

But back to Python!

Are there JITs for Python?

Absolutely! There’s PyPy, there’s Numba, and Microsoft has Pyjion (the link is to GitHub sources). There are a few others, which tend to be less well known and therefore less utilized or maintained.

The Numba project can even statically compile Python — or a subset of RPython, which is itself a subset of Python — to native code.

Unfortunately, when you use Python, you’re probably not using any of these, and you’re probably using the standard Python implementation, CPython instead.

So the bottom line?

Python is slower because it’s an interpreted language.

This question originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world. You can follow Quora on Twitter, Facebook, and Google+. More questions: