2011 February


 Just when I thought I could bear Ruby

February 24, 2011

After my previous diatribe, I had no option but do some further research, just as a sanity check.

So, along comes Ruby.

(from wikipedia)
According to the Ruby FAQ,[22] “If you like Perl, you will like Ruby and be right at home with its syntax. If you like Smalltalk, you will like Ruby and be right at home with its semantics. If you like Python, you may or may not be put off by the huge difference in design philosophy between Python and Ruby/Perl.”

Sounds good actually!

ashpool:~# apt-get install ruby
ashpool:~# cat >test.rb
print "hello world\";
ashpool:~# chmod 755 ./test.rb
ashpool:~# ./test.rb
hello world

Hmm, that was actually pleasant! And I used a semicolon successfully!

So, then I do some further reading:

(from wikipedia)
Boolean evaluation of non-boolean data is strict: 0, “” and [] are all evaluated to true. In C, the expression 0 ? 1 : 0 evaluates to 0 (i.e. false). In Ruby, however, it yields 1, as all numbers evaluate to true; only nil and false evaluate to false.

What? You just broke binary! A fundamental computing principle. And also, by the way, the “principle of least astonishment”. Yes, I understand that it’s not supposed to be entirely true according to Matsumoto, but quite frankly I’m astonished.

To soften the blow:

A corollary to this rule is that Ruby methods by convention β€” for example, regular-expression searches β€” return numbers, strings, lists, or other non-false values on success, but nil on failure.

That’s fine, I don’t mind your convention, and it makes neat sense, since zero isn’t always a good indication of failure and a nil, tells me something more than zero.

But for fucks sake.

0 is false, 1 is true. And it doesn’t matter what the radix is… I see no reason to mess with that basic computing principle for the sake of a convention involving nil.

Once again — please, Mr. Interpreter, get out of my way, and let me get on with the job of programming. I’ll decide what is true and false, thank you very much. And 1 is true and 0 is false.

So, let’s try something simple. A loop!

cat >./test.rg
for i in (1..2)
print "hello world\n";
ashpool:~# ./test.rb
hello world
hello world

That wasn’t so bad. The loop construct seems a bit simple (or maybe just nonclassical) but can be neatly used to iterate data structures such as one would expect from a dynamically typed language.

But srsly — “end” ?

I feel like I’m back in Pascal hell already. But at least the whitespace wasn’t significant. And “end” at least made things quite clear.

Being able to use a semicolon (without complaint) was also quite nice, and refreshing, for a change. But honestly, would it really have killed the language to allow {} as shortcuts ?

Aaah, wait, I see the problem here…

{} is also syntax for blocks, but only when passed to a method OUTSIDE the arguments parens.
When you invoke methods without parens, ruby looks at where you put the commas to figure out where the arguments end (where the parens would have been, had you typed them)


1.upto(2) { puts 'hello' } # it's a block
1.upto 2 { puts 'hello' } # syntax error, ruby can't figure out where the function args end
1.upto 2, { puts 'hello' } # the comma means "argument", so ruby sees it as a hash - this won't work because puts 'hello' isn't a valid hash

Holy shit.

This is why we have “()” and family in a decent language. Look! Here are the args — “()”. Here is the code block — “{}”. Here’s an array — “[]”. Here’s the end of the statement — “;” Zero ambiguity.

It appears that every time someone invents a new language to address the supposed inadequacies in other languages something goes awry.

I’m still willing to give Ruby a bit more of a go, since there’s seems more sensibility and flexibility than them Python dragons enforce.

For example, my simple little string “acid test”.

ashpool:~# cat >test.rb
print test;
ashpool:~# ./test.rb

I’m going to have a tough time dealing with the stupidity of 0 and 1 though…

Perhaps I’m too much of a traditionalist, and and old fart, but quite frankly, I’d rather be coding Perl, PHP, or C.

Uncategorized | 6 comments
6 responses to “Just when I thought I could bear Ruby”
  1. JP says:

    Eh, on python’s technical merits I’ll still fight the good fight, but at Ruby I draw the line. We can commence the ridicule of the language on the balcony later today πŸ˜‰

  2. Simeon says:

    Programming languages is a favorite topic of mine.

    I have to say, other than your comments about Python namespaces in the previous post, the gripes with Python and Ruby you’ve listed so far appear to be largely irrelevant to the language’s suitability for solving problems. More important design aspects can be described. Consider the following attributes:

    * Generality and orthogonality
    * Typing, binding, scope
    * Supported control and data structures
    * Available mechanisms for abstraction
    * Implementations and tools
    * Standard libraries and documentation
    * Portability and platforms supported
    * Performance

    Unless your objective is merely be to write about subjective syntactic preferences, I would be keen to see some commentary from you along these lines.

  3. Colin Alston says:

    Ahh the joys of Ruby. Have you discovered all the cases where it will silently accept conversion data without throwing any exception?

    irb(main):004:0> Time.parse(“Your mom”)
    => Mon Feb 28 12:56:25 +0200 2011
    irb(main):005:0> “Your mom”.to_i
    => 0

    Fun times

  4. Colin Alston says:

    Oh yes and

    irb(main):006:0> “abc”[1]
    => 98


  5. Colin Alston says:

    Did you do this one as well?

    irb(main):001:0> x = 0
    irb(main):002:0> if x
    irb(main):003:1> puts “ok”
    irb(main):004:1> end

    Why is integer zero equitable to true?

  6. roelf says:

    Haven’t tried all of those corner cases yet. But in keeping with Simeon’s suggestion, I’m building a more fully fledged evaluation. To be published Real Soon ™.


 The Zen of Python?

February 22, 2011

ashpool:~# python -m this
The Zen of Python, by Tim Peters

Tim Me
Beautiful is better than ugly. Yep, and python is ugly

if x < 0:
...      x = 0
...      print 'Negative changed to zero'
... elif x == 0:
...      print 'Zero'
... elif x == 1:
...      print 'Single'
... else:
...      print 'More'

“elif” ??? What the fuck is wrong with “else if”, or even “elseif” ?

Who is this “elif”? Something from Lord of the Rings mayhap ?

Explicit is better than implicit. Yeah, because everyone likes shit like “ext://sys.stdout” and sys.stdout instead of let’s say “stdout” ?

There is such a thing as too fucking explicit.

Simple is better than complex. Oh, you mean like PEP 3149. Yeah, let’s invent yet another shared object naming convention! and, let’s distinguish between varying interpreters such as “CPython, PyPy, Jython, etc” as well. That sounds terribly simple.
Complex is better than complicated. I missed the point of this one, especially, after investigating the prior.
Flat is better than nested. That’s just because you don’t have curly braces in python and you have to use whitespace to nest πŸ™„

For structures, nesting with proper delimiters such as {}, and () makes very good sense and is easily readable and “findable” with a decent editor. It appeals to the eye and makes structures easy to delineate and find.

Never mind the fact that shit like ctags loves it. Where’s pytags ?

In a language without these tokens, you’re fucked.

Of course, flat procedural code and “bailing early” is better, but this is true regardless of the language.

Sparse is better than dense. Yup. And the existence or nonexistence of curly braces does not affect this. In fact pep-0008 recommends all the general whitespace rules used by decent programmers, regardless of curlies.
Readability counts. All of the python views on readability viz function names or members are indeed useful, and common practice. Well done.

However let’s compare

while True:
    X = prelim_two()
    Y = another_thing(X)
    if X > Y:
    yet_another(X, Y)


while (1) {
    X = prelim_two();
    Y = another_thing(X);
    if ( X>Y )
    yet_another(X, Y);

A single extra line and judicious use of semicolons, brackets and braces has just separated all code neatly, and delineated everything to editor, and eye as to where they belong.
Sorry, but C syntax wins as far as readability is concerned.

Special cases aren’t special enough to break the rules. Cannot argue with this one. And again, should be standard practice for a decent language.
Although practicality beats purity. Oh wait, Indeed. Let’s consider the very “special” case of dealing with strings and embedded variables.

Which is why most statically typed languages have a printf() style formatting function, and dynamically typed languages offers a plain $ dynamic substitution within strings, or printf() style functionality. Let’s let the programmer decide on his choice of poison…

So, Python would like you address a string as an object using string methods and all of the syntactical “special cases” that has been added on top of strings.

And as a result, something basic, something that nearly every language can do, such as addressing a string as an array Python does not allow!

Python doesn’t allow common string manipulation such as word[0] = ‘x’, as can be done in C, Perl or PHP.

Python does not allow the common mechanism of variable substitution in a strings such as $var=”$bar”;

So Python saves you on the curlies, and braces and fucks you on strings. The most primitive fucking programming construct known to man!

Give me more practicality please.

Errors should never pass silently. No. Errors should never cause an exception, due to for example an integer that could not be parsed.

Please don’t toss a fucking exception thank you very much, because I, the programmer have probably got more of an idea about what I care about than you do thank you very much Mr Interpreter.

You get along with running as much of the code as you possibly can and I’ll get along with the business of deciding what’s wrong and right. Don’t get in my fucking way. I’ll throw the exceptions when I want to, and of the kind I want to, and when I feel like catching them. System level stuff should never get in my way because otherwise we end up in Java-land.

And sprinkling the parsing of shit like integers and real numbers with try()’s and finally() isn’t useful thanks, I thought we were trying do more with less. I’ll check if I like the integer or not.

Unless explicitly silenced. Yeah, I can sort of manage try() and catch() by myself here. And my view is “shut the fuck up until spoken to”.
In the face of ambiguity, refuse the temptation to guess. Agreed. A rule is a rule, stick with it.
There should be one– and preferably only one –obvious way to do it. No thanks. Give me options, if I’m a printf() fanatic let me use it. If I’m a string substitution junkie let met do it. In fact, allow me to do it as many ways as I like.

As a programmer I kind of like not being in a fucking sandbox. If I did, I’d be coding Java.

EXCEPT if it’s basic syntax and language constructs.

Although that way may not be obvious at first unless you’re Dutch. Guess that’s why we cannot classic ternary operators such as the classic “result = (a > b) ? x : y;”

So, because of dutch we had to get “{True: x, False: y}[a > b]”

Yeah, that was the really fucking obvious way to do it πŸ™„ It’s nice to be different and all, but not to the point of fucking stupidity. Obvious stupidity.

Now is better than never. Yeah, if Python had curly braces and decent string manipulation I’d be using it now.
Although never is often better than *right* now. Yup, I’ll never use Python *right* now.
If the implementation is hard to explain, it’s a bad idea. Yeah, still trying to figure out how to dynamically change strings like they’re an array.

“Unlike a C string, Python strings cannot be changed. Assigning to an indexed position in the string results in an error:”

>>> word[0] = 'x'
Traceback (most recent call last):
  File "", line 1, in ?

However, creating a new string with the combined content is easy and efficient:

>>> 'x' + word[1:]
>>> 'Splat' + word[4]

Wait, what the fuck? Easy? Efficient? Sorry I missed that fucking explanation. Or maybe, it was just too hard to explain away that problem?

If the implementation is easy to explain, it may be a good idea. Sorry, but I’m still fucked on the previous one. Strings. Basic. Easy. Obvious?
Namespaces are one honking great idea — let’s do more of those!
>>>import honk.great.idea.and.however.the.fuck.deeply.you.want.to.nest.it.
>>>   print "hello"
Traceback (most recent call last):
  File "", line 1, in 
>>>> Exception "print" does not do what you think it does. Unless you're Dutch.

Namespaces are dangerous. Trust me, I once gave a TCL project to a bunch of girls that figured out what namespaces was. They thought it was all “neat” and “tidy”. I thought that I couldn’t read identifiers nested twelve levels deep. Namespaces pollute the basics of language.

Don’t do it. Create an object instance. Tada. Instant namespace.

Namespaces are useful, but for fucks sakes, don’t “honk it as a great idea”.

Python is like the bastard child of Java. I will never write serious code in it. Comments, corrections and diatribes welcome.

Footnote: If python had curlies, I’d probably be it’s biggest fan. Aside from the string fucked-ness.

Uncategorized | 24 comments
24 responses to “The Zen of Python?”
  1. Simeon says:

    > Footnote: If python had curlies, I’d probably be it’s biggest fan.


    There, fixed.

  2. roelf says:

    Yep, I’ve actually tried pybraces before. It was a bit flakey though, but worthwhile.

    Of course, then I hit string manipulation, which wasn’t so bad, but a bit tiresome. I can live with the printf() world.

    Unfortunately, for me — pybraces, is a crutch (or brace) for a basic design flaw, which is the lack of sadly, braces.

    Without pybraces everywhere, I’d never be able to maintain some other guy’s code, and my eyes would continue to bleed trying to read their code.

    So, it’s really just easier to pick a dynamically typed language that does have braces, such as Perl, PHP, JavaScript, or LUA.

    With LUA, or JavaScript and variants I get all the crappy stringhandling, but at least I get braces πŸ™‚

  3. jerith says:

    Before I begin, a quick note about my intentions. I’m not trying to convince you to use Python. I happen to like the language, but it isn’t suitable for everything and I actually don’t care one way or another if you use it or not. I’m posting this comment because I don’t think you’re being fair to the language and I don’t want people reading your post to go away with the impression that what you’ve described is the whole story.

    A lot of what you’re complaining about is really quite peripheral to Python, and you’re hitting it a lot because you’re trying to use C or Perl idioms in a language that isn’t suited to them. Python isn’t perfect, and it never will be, but cherry-picking the bad bits to make a point doesn’t prove very much. I’ll address a couple of specifics below.

    Indentation-based blocks:

    This is something that a lot of people don’t like, and it takes a bit of getting used to. Tabs and spaces aside, the indentation thing strongly encourages people to write code that can be parsed easily at a glance. Sure, it’s easy enough to indent correctly in brace-delimited languages, but I’ve spent a lot of time debugging code that doesn’t do what I think it does because the indentation is misleading. Once you have correct indentation in all cases, the braces just become noise. (Of course, you need a certain amount of editor support to easily manipulate Python code, but you need that to easily produce consistently readable code in any language.)


    You pick on string immutability quite a lot in your examples, so I’ll explain a bit about them. Immutable strings make a great number of things possible and efficient that would otherwise be problematic. For example, dicts require their keys to be immutable[1], and having strings as dict keys is rather useful. Unlike many other languages, strings are not just arrays of characters and therefore behave differently.

    Strings in Python do not have embedded variables, but substituting them in is very simple, and fairly similar to an extended printf() in C, although it uses an operator rather than a function call:

    name = “jerith”
    foo = “Hello, %s!” % name

    foo = “%(greeting)s, %(name)s” % {“greeting”: “Hello”, “name”: “jerith”}

    If you don’t need to support Python versions prior to 2.6, you can also use the newer formatting stuff described in http://docs.python.org/library/string.html#formatstrings

    Namespaces and explicitness:

    This is one of my favourite things about Python. If you’re using sys.stdin and sys.stdout a lot, you can say “from sys import stdin, stdout” at the top and just use stdin and stdout everywhere else. On the other hand, libraries (stdlib included) stay out of your way until you explicitly ask for them. Compare this to PHP’s shared global namespace where you have to prefix all your functions with a disambiguating prefix and it’s hard to write a drop-in replacement (or wrapper around) a standard library that doesn’t do quite what you want. You’re trading a little bit of import pain (which is really quite trivial) for the greater dependency management pain that is common in so many other languages.

    Errors passing silently:

    This is the one place I have to strongly disagree with you in general terms. I would *much* rather have my application crash when it doesn’t understand something than have it guess wrong. The single largest cause of software failure is programmer error, and I’m experienced enough to know that I write my share of bugs. In addition, customers (even if they’re other programmers) are really good at coming up with nonsensical data that we don’t even consider that we might need to handle.

    Story time: I recently had a financial reporting system crash on me because someone put ‘GBP22’ into a field that was supposed to be a numerical USD amount, but was stored as a string for other reasons. If I’d silently treated it as an integer (either 0 or 22) the thing would have gone unnoticed and we would simply have gotten the wrong numbers out of the other end. The nature of the exception led me directly to the cause of the error, a workaround was in place within an hour and the inconsistency in the data was known and accounted for in the final report even though a proper fix (which requires us to figure out what to do with non-USD amounts) is still pending. Because of this, we can have greater trust that there aren’t subtle bugs in the code that are turning our reports into garbage.

    To conclude, I’ll return to your apparently insoluble problem of the lack of braces as block delimiters. It’s a change from what you’re used to, and it may be a dealbreaker, but I suspect that after working with Python for a few weeks you’ll get used to it and hardly notice it anymore. From where I stand, your objections sound a lot like someone refusing to learn German because it has umlauts and a more convoluted sentence structure than most other languages.

    [1] Well, hashable, which is slightly more complicated, but we can ignore that distinction for now.

  4. roelf says:

    Thanks Jerith, I wasn’t aware of the % operator functioning as a placeholder, although slightly clunky.

    The new Python>=2.6 “Form String Syntax” looks to be what I’m after, and seems natural enough to quell that specific string gripe.

    How do I address this ?


    • jerith says:

      The most direct equivalent is probably turning the string into a list (which is mutable) and back again:

      foo = “123”
      #=> “123”
      bar = [ch for ch in foo]
      #=> [“1”, “2”, “3”]
      bar[1] = “b”
      #=> [“1”. “b”. “3”]
      foo = “”.join(bar)
      #=> “1b3”

      This is really ugly and incredibly verbose. (Aside: I used a list comprehension in the second line to turn the string into a list of strings. Listcomps are made out of happiness and kittens and were stolen from functional languages, like Haskell and Erlang.)

      You can also get a similar effect by slicing and joining the string, which is more Pythonic, but still ugly:

      foo = “123”
      #=> “123”
      foo = “”.join([foo[:1], “b”, foo[2:]])
      #=> “1b3”

      You could even package a slightly more general form of this into a little function if you need it a lot:

      def replace_from(src, index, new):
      return “”.join([src[:index], new, src[(index+len(new)):]])

      replace_from(“123”, 1, “b”)
      #=> “1b3”
      replace_from(“12345678”, 2, “cde”)
      #=> “12cde678”

      (Because of the behaviour of string slicing, this works even if you run off the end of the string, although the results may be a bit odd if index > len(src).)

      Now that that’s out the way, do you have a real-world example of actually needing this kind of functionality? Depending on what you’re trying to do, there’s probably a better way.

      * str.replace() is good for replacing specific substrings.
      * str.translate() is more generally useful than it sounds.
      * Building output up from format strings is usually best if you’re trying to build a particular output format.
      * Specific tools exist for manipulating common typed of string contents, such as paths and URLS.
      * There’s a regex library that is pretty good if that’s more your style.

      The way I see it, you’re asking the wrong kind of question. Rather than trying to implement a particular solution in Python, you should take a step back and find a solution that works well in Python. Learning a language’s idioms is as important as learning its syntax, and far more subtle. Instead of bashing your head against the things that Python doesn’t do well, look for a different approach that uses its strengths.

      This isn’t as easy as I make it sound, of course. It took me a year of writing Ruby professionally before I stopped trying to treat it as Python-with-a-different-syntax. Ruby still isn’t my favourite language, but at least I dislike it for the right reasons now.

      • roelf says:

        As indicated in further comments, bytearray is most likely going to do what I need to do. The application could be as simple as a state machine that parses a string and builds a reply based on the original string using complex states, but this is largely irrelevant. I’ve done some Netflow protocol translation, and string copies turned out to be a gazillion percent slower than simply modifying existing buffers.

        • jerith says:

          Ah, optimisation. Firstly, if you need really high performance low-level data fiddling, Python is probably the wrong language. On the other hand, that kind of performance is rarely necessary and comes at a very high cost in terms of code readability and maintainability.

          Secondly, your instincts about the performance of various operations almost certainly don’t carry over to Python. It’s a high level language and a lot of work has been put into improving performance, which sometimes has counterintuitive results. Most of this effort has been put into making “Pythonic” solutions fast, so using the language’s idioms is generally a good first step. You can experiment with more convoluted implementations if that turns out to be too slow, but measure first.

          Thirdly, and more generally, profile before you optimise. If your overall performance is I/O bound, it really doesn’t help to double the speed of your processing — you’ll need to wait just as long for the next piece of data to arrive anyway. Above all, remember Knuth’s admonition.

          So, to return to the specific case of in-place modification of strings, you’re approaching it from the wrong angle again. Python is a high-level language, so you’re giving up tight control of low-level details in exchange for expressing high-level concepts more clearly. Python’s string implementation is well-suited to a whole bunch of things that are harder to do with arrays of characters, but many of these things require string objects to be immutable. Of course, this same immutability gives you far better performance in some places where using mutable strings would require O(n) operations on the contents of the string.

          If you really need the performance in a small part of the code, you could use ctypes and a small external C library to do those bits. Or you could use a bytearray(). Just make sure you measure the various alternatives and don’t choose a “gut feel” solution that’s actually slower than the thing it’s supposed to be optimising.

          • roelf says:

            Why not write optimal code from the start ? And I certainly disagree that it is rarely necessary. I knew “naturally” what the correct way was to deal with the problem — limit buffer copying.

            My Netflow scenario was not really I/O bound, but memcpy bound. And I did manage to do it in a very high level language (Perl) without having to resort to C. 100 lines or so of Perl achieved the same thing because the underlying tenets and mechanisms were the same as C. Maybe that’s my problem. I’m looking for an interpreted C language.

            But that’s not really the point, as indicated it can be done in Python too using bytearrays, but the entire thing has become convoluted. It just makes the entire thing so unnatural that it’s off-putting. My gut feel and decent understanding of C types and the way memory works allowed me to do things in Perl that I consider best practice.

            I guess this comes to Simeons views on “Generality & Orthogonality”. Maybe python is just the wrong tool. Perhaps I’m looking for a swiss army knife in Python, where I’m unlikely to find it.

            Or maybe I’m just a stubbon old fart.

  5. Colin Alston says:

    Is there a reason you want to do that? If you re-evaluate it in the context of other things you can do with the language, you might find Python has a better way to handle what you’re trying to do.

  6. roelf says:

    Let’s simplify the question. How do I replace the second character in the string “123” assigned to the variable “foo” with the letter “b” ? (edit: Jerith has given some examples, none of which are pretty).

    I can think of many reasons for wanting to do that, including some protocol implementations (example Netflow translation mentioned earlier).

    I suspect that a more generic buffer or array of characters or MutableString which is now *sigh* deprecated does what I need.

    In fact that, some digging turns up bytearray() which is some sort of hybrid string. I’m already starting to feel the first symptoms of Java “type hell” here.

    An interesting performance read due to immutability is http://skymind.com/~ocrow/python_string/ even though the example there is related to string appending.

    • jerith says:

      I still think you’re trying to solve the wrong problem. Instead of “How do I replace the second character of this string?”, ask “How do I build a handler for this protocol?”. I’ve built a number of protocol handlers in various languages, and the only time I’ve even wanted to modify bits of strings in-place has been for performance hacks in assembler or embedded C on microcontrollers with very limited resources.

      • roelf says:

        I’m pretty sure I’m not asking the wrong question, as my question is related to many possible scenarios. But since you keep insisting…

        Let’s create a scenario that I’ve coded in Perl before. It’s a Netflow packet “enrichment” tool.

        We need to read Netflow V5 packets on a UDP socket, perform an AS lookup on the source and destination IP address (srcaddr,dstaddr -or- bytes 0-7 in the flow record), and fill in the src_as, and dst_as numbers in bytes (40-43) of the flow record based on a GeoIP database lookup.

        In Perl, I simply read the packet (known to be limited to ~1500 bytes maximum) of the v5 record from the socket, AND WITHOUT COPYING THE BUFFER, do the lookup of srcaddr, and dstaddr, and populate the src_as and dst_as fields. This is done with a simple bit of unpack() and direct string/buffer modification.

        I then ship the modified original buffer to a new IP address using UDP some third party Netflow collector is running.

        I was able to prototype this code in an hour, and run it in production for 300-400Mbps worth of transit’s Netflow records without breaking a sweat. Perl’s string mutability saved the day. If I coded it in C, I’d probably have taken the same approach. I simply used read(), some string/buffer manipulation, and write()

        In Python:
        1. Stream read() will only return a string/bytes, which is IMMUTABLE (meaning buffer/memcpy()’s copies required before I can even start)
        2. So I have to use Stream.readinto() because that’s the only stream method with a signature capable of reading into a bytearray which IS MUTABLE (meaning no buffer copying required)
        3. Now I mess with the bytearray doing the src_asn and dst_asn magic.
        4. and then use Stream.write() to write the modified bytearray to a new destination.

        My point is that the structure that I used in Perl, or C, was effectively a buffer that I could manipulate as a string, or array however I liked, without copying the buffer around and it was NATURAL in the language. Heck it would even be natural in PHP. I used stream read()/write() operations and some string manipulation (yes using indices)

        In python, I have to learn about the secret sauce of a bytearray() and then the special method readinto() (nicely orthogonal isn’t it?) or go in to string copying la-la land.

        I have lots of experience in this area to dispute your view that limiting buffer copies is for “performance hacks in assembler or embedded C on microcontrollers with very limited resources”. memcpy() is a performance killer. Plain and simple.

  7. roelf says:

    I’m definately sticking with my view on “If the implementation is hard to explain, it’s a bad idea.”

  8. jerith says:

    So, here’s a timing comparison using strings and slicing vs in-place modification with bytearrays. The code I used is at http://paste.ubuntu.com/573541/

    lantea:rnd jerith$ ./strstuff.py

    Strings and slicing:

    make_record(): 14.753880024
    make_and_process_record(): 19.208892107
    make_and_process_record(True): 21.4235928059
    Processing time: 4.45501208305


    make_record(): 15.6495079994
    make_and_process_record(): 19.0997409821
    make_and_process_record(True): 21.3973038197
    Processing time: 3.45023298264

    Speedup percentage: 22.5539029228

    So you’re saving about a quarter of your processing time by using an in-place bytearray rather than building strings. It’s probably not negligible, but it isn’t a huge amount. Once you consider that this is probably only a small part of the processing cost, that number drops dramatically. If you include the time taken to convert a record into a bytearray for this test, the difference really is negligible. While you may save on that in a system that reads from the network, the geoip lookup is likely to take longer than the record formatting and the overall time saving is likely to be rather smaller than the 25% or so measured in this little test.

    So yes, there’s a performance difference. It’s even a nontrivial difference in my contrived test where building the record is the bulk of the time taken. It’s substantially less than an order of magnitude, though, and is unlikely to make a real difference in a production-quality system that actually does real work on real data. Personally, I’d go with the easier solution (using immutable strings) and rewrite in C or something if the performance was problematic. This isn’t the kind of problem Python solves particularly elegantly, but it’s certainly doable without trying to subvert the language.

    As to your comment about limiting buffer copies, you have a point, but only to an extent. If you know exactly what the machine is doing in all cases, you can avoid unnecessary buffer copies. I don’t know a huge amount about how Python handles strings internally, but it’s entirely feasible that slicing and concatenating doesn’t result in a memcpy() of the entire string contents — Erlang shares the internals of lots of immutable data structures to avoid copies, and it can do this because they’re immutable. There’s certainly a lot more to Python’s strings than the raw sequence of characters, and I wouldn’t presume to make performance assumptions without fully understanding what is involved. In any case, it’s usually easier to write the code and measure than analyse the internals of a rather complex language runtime.

  9. roelf says:

    Thanks for the reply Jerith. I appreciate the time you’ve taken to do some boilerplate code that is clearly well thought out in terms of the problem domain. I enjoy constructive arguments that are backed by some decent riposte and as such, I have to counter-riposte πŸ˜‰

    I’m pretty sure I don’t know all about how Python handles immutable strings and their “copies” but I have a feeling it involves some overhead. Your basic benches have indicated as much over a single iteration.

    I will construct a basic acid test in Perl, to compare and let’s see how far the rabbit hole goes. Unfortunately for tonight, I’m dealing with month-end and billing runs so I’m a bit out of the game for the next 36 hours.

  10. Rossi says:

    string = “python sucks”

    string[7] = “r”
    string[8] = “o”

    wait.. that does not work..
    so python sucks .. sorry python

    • jerith says:

      foo = “perl sucks”
      foo = [ch for ch in “perl rocks”]

      Wait, that doesn’t work. Sorry, Perl, you suck.

      I can do that too, for any language you like. What does it prove?

  11. Rossi says:

    I’m sorry

    This: http://paste.ubuntu.com/573541/

    or: var[40] = 1, var[41] = 2, var[42] = 3, var[43] = 4;

    the latter being available in most languages…. and it does not need performance measurement as it a pretty basic CPU instruction. SET THIS TO THAT!

    • jerith says:

      If you want C, you know where to find it. Of course, C strings can’t do stuff like this:

      >>> name=”jerith”; (“hello %s” % name).title().center(20)
      ‘ Hello Jerith ‘

      A Python string is not an array of bytes and trying to treat it like one will end in tears.

  12. roelf says:

    Agreed. Set byte [x] to y. Should be bog standard. And it can be in Python. Except you have to know many “secret sauces”.

    • jerith says:

      It’s standard in any mutable sequence type. Including a string-like one that’s in the standard library.

  13. roelf says:

    Whilst we’re at it, let’s discuss ternary operators, right ? “yes” : “no”

    • jerith says:

      The syntax is a bit ugly, but it’s there since at least 2.5:

      >>> “true” if True else “false”
      >>> “true” if False else “false”

      Of course, there’s the old fallback that uses boolean operators that has been around forever:

      >>> True and “true” or “false”
      >>> False and “true” or “false”

  14. jerith says:

    If every language had an identical feature set, there would be no point in having different languages. Python has immutable strings. Perl has native syntax for regular expressions. C has pointer arithmetic. Ruby has mutable classes. Java has formal standards out the wazoo. Erlang has incredibly cheap process creation. BASIC has line numbers.

    Try using Python for what it is rather than treating it as something it isn’t. Like any tool, there are things it isn’t really good at, and the Netflow example is probably one of them — it gets the job done, but maybe not as elegantly as a different language. If you don’t like the tradeoffs, use a language that makes different ones.