Profiling and speeding up Python


#1

Hi All,

I was recently working on doing some system performance testing (of Elasticsearch) and I was trying to generate some fake data so I used the faker package to generate some data. The code looks like this:

import timeit
import pytz
from faker import Faker


def fake_log():
    return {
      "ip": fake.ipv4(network=False),
      "userAgent": fake.user_agent(),
      "url": fake.url(),
      "uuid": fake.uuid4(),
      "ipv6": fake.ipv6(network=False),
      "location": '-25.%i, 56.%i' % (fake.pyint(), fake.pyint()),
      "created": fake.iso8601(tzinfo=pytz.timezone(fake.timezone())),
      "bytes": fake.pyint()
    }


def fixed_log():
    return {
        'ip': '184.252.12.203',
        'userAgent': 'Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3 like Mac OS X; en-US) AppleWebKit/535.34.1 (KHTML, like Gecko) Version/3.0.5 Mobile/8B118 Safari/6535.34.1',
        'url': 'http://www.martin-moran.com/',
        'uuid': '64b41337-6733-9355-c2a2-d71ea9699b25',
        'ipv6': '738c:71a4:14fc:c2d0:2c69:e78d:61de:2332',
        'location': '-25.661, 56.8805',
        'created': '2006-09-25T03:27:28+03:00',
        'bytes': 9352
    }


fake = Faker()
for _ in range(1000):
    fake_log()

The problem is, it’s enormously slow, which is why there is a fixed_log() function in there. So just for amusement I started trying to track down the source of slowness to see if something could be done about it. The only thing I’ve tried so far is to run the standard Python profiler on it which shows the following:

$ python -m cProfile -s time ./faker-test.py
         2406466 function calls (2395019 primitive calls) in 1.905 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1952    0.709    0.000    0.709    0.000 distribution.py:22(<listcomp>)
  1953952    0.276    0.000    0.276    0.000 distribution.py:10(cumsum)
     1952    0.210    0.000    1.201    0.001 distribution.py:17(choice_distribution)
    11276    0.192    0.000    1.420    0.000 __init__.py:98(random_element)
      247    0.039    0.000    0.074    0.000 tzfile.py:26(build_tzinfo)
    25275    0.024    0.000    0.036    0.000 random.py:222(_randbelow)
    14339    0.017    0.000    0.038    0.000 random.py:172(randrange)
    14624    0.013    0.000    0.013    0.000 tzinfo.py:27(memorized_datetime)
4000/1000    0.013    0.000    1.449    0.001 {method 'sub' of '_sre.SRE_Pattern' objects}
     1941    0.013    0.000    0.013    0.000 {built-in method posix.stat}
      840    0.011    0.000    0.011    0.000 {built-in method io.open}
     1000    0.011    0.000    1.791    0.002 faker-test.py:7(fake_log)

So, I see that the cumsum function in distribution.py is called 1.953,952 times inside my loop of 1000 calls. The function looks like this:

def cumsum(it):
    total = 0
    for x in it:
        total += x
        yield total

From here, I am not sure how to optimize it. Anyone have any ideas?

Austin


#2

What is the input to cumsum? A list? A numpy array? If a numpy array use the numpy version of cumsum.

Or use numba jit… although it really shines when inner loops are a bit more complex and you find yourself having to iterate manually over numpy arrays.


#3

I’m not sure since I’m not the author of cumsum(), but I will check it out.

I was thinking it might just be easiest to implement the function in C. Though I don’t think I’ve ever done a native extension like that. I hadn’t thought about using numpy.

Austin


#4

Assuming you’re using python 3.6 itertools.accumulate is available (and written in c). But if you’re using 3.6, you could swap out choice_distribution for random.choices which directly take weights and uses itertools.accumulate under the hood. On my laptop it looks like about a 7x speedup.

In [13]: a = list(range(100))

In [14]: b = list(range(100))

In [15]: %timeit choice_distribution(a, b)
61.5 µs ± 943 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [16]: %timeit random.choices(a, weights=b)
8.37 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#5

just fyi:

import numpy as np
import timeit

def cumsum(it):
    total = 0
    for x in it:
        total += x
        yield total

r = np.random.RandomState(123)
u = r.randn(10000)

%timeit list(cumsum(u))
# 100 loops, best of 3: 2.26 ms per loop
%timeit u.cumsum()
# 10000 loops, best of 3: 30.7 µs per loop

numpy FTW…

I was thinking it might just be easiest to implement the function in C. Though I don’t think I’ve ever done a native extension like that.

Python extensions are dark evil magic fraught with peril; avoid them whenever possible. The numpy and numexpr and numba libraries will usually handle speed optimization for you, without having to drop down into evil-land.


#6

My big concern with using numpy is that it’s such a substantial dependency in a module that should be easy to install. I know the dependency resolution story has gotten a lot better over the years but there are people out there that still use RHEL 6 and other similarly unfortunate souls. I’d be inclined to just follow the Python 3.6 route to begin with.

If I were to try using numpy I’d make it an optional/recommended dependency but ensure it would degrade gracefully. Either way, both of your suggestions are good and I will put together a pull request for the module maintainer.

Thanks for your feedback!

Austin


#7

Looks like will happen when choices contain probabilities. Was a bit of a rabbit hole to find which of your fakes was pulling from a dictionary. Should be able to find that more easily via your profiler’s print_callers().

I came here to mention new frame evaluation api in python 3.6. Might make for an interesting meetup topic and could have many application. One could be a way to do application level instrumentation at little overhead and controllable at run time.


#8

It seems likely that only a few distributions are being generated, but they are being generated over and over (and over) again. If that’s the case, memoizing the distributions seems likely to be both more effective and simpler than speeding up cumsum.

If you were a terrible person, you might even be able to just monkey patch a “fixed” version of choice_distribution (or wherever it makes sense to memoize) into distribution.py… Fortunately, I’m feeling too lazy this afternoon to corrupt the youth of today.

[edit]

So I lied, I went ahead and monkey patched this and managed to speed it up by about a factor of 10. Unfortunately, faker seems to be designed to be slow :-(, so I had to do some dubious stuff to get this approach to work.

The issue is that you can’t hash dictionaries, so to memoize, you either need to hash by id – fast, but dangerous – or hash using the keys and values converted to tuples or equivalent (slow, but safe). The fact that this key/value conversion is so significant, likely means that efforts to speed up cumsum are not likely to result in much speed gain.

Here’s the dubious monkey patching code for your enjoyment(?):

from faker.utils.distribution import cumsum, random_sample, bisect
from faker.providers import random

_choice_distribution_memo ={}
def choice_distribution(elements):
    # This is pretty dangerous, if the dictionary is modified without
    # changing the length, this will break ;-(
    key = (id(elements), len(elements))
    # Safe but slow
    # key = (tuple(tuple(x) for x in elements.items())) 

    if key not in _choice_distribution_memo:
      a = list(elements.keys())
      cdf = list(cumsum((elements[x] for x in a)))
      normal = cdf[-1]
      cdf2 = [float(i) / float(normal) for i in cdf]
      _choice_distribution_memo[key] = a, cdf2

    a, cdf2 = _choice_distribution_memo[key]

    uniform_sample = random_sample()
    idx = bisect.bisect_right(cdf2, uniform_sample)

    return a[idx]

@classmethod
def random_element(cls, elements=('a', 'b', 'c')):
    """
    Returns a random element from a passed object.
    If `elements` is a dictionary, the value will be used as
    a weighting element. For example::
        random_element({"{{variable_1}}": 0.5, "{{variable_2}}": 0.2, "{{variable_3}}": 0.2, "{{variable_4}}": 0.1})
    will have the following distribution:
        * `variable_1`: 50% probability
        * `variable_2`: 20% probability
        * `variable_3`: 20% probability
        * `variable_4`: 10% probability
    """

    if isinstance(elements, dict):
        return choice_distribution(elements)
    else:
        if not isinstance(elements, (list, tuple)):
          elements = tuple(elements)
        return random.choice(elements)


faker.providers.BaseProvider.random_element = random_element
faker.providers.random_element = random_element

#9

Great suggestions guys! So we have

I guess the thing I didn’t state in my original post was that my immediate problem is solved so I don’t need a temporary fix, so I won’t be doing any monkey patching but thanks for suggesting a dirty trick @bitsofbits. I’d like to come up with something that makes sense to include in the library in general.

Oh, so I just tried memoization but hadn’t seen that @bitsofbits updated his post with a solution already. I got bogged down with the memoization of a list and had a solution that converted the list to a tuple (like your commented out solution). I’ll try and understand your edit. Thanks!