Multiprocessing with Selenium and Python for website crawling

Sven Schannak
2 min readJun 13, 2016

--

Selenium is a great tool to automate simple processes for webservices, that don’t have an API in general or for a specific use case. Think of clicking on a button to start an event on a website for your online-shop. Our team at Gusti Leder is using Selenium to get control of our current ERP, because their API is not secure enough for our standards and pretty slow.

In our case we pull informations from the frontend of the ERP. Like informations from a list or from a detail page of an order. Selenium will fill in some informations to a search form field and if we find the right order we pull the informations from the website. This way is way more faster than the API of the ERP and we can use more filters than we can use at the API.

But we can still try to make this process faster. And we did this optimization quite fast with python multiprocessing and pool. You can find the basis knowledge you need to know about this right here:

Basically you should not start more processes than you have core on your processor. The best would if you use the number of your processes minus one, because you still need one core for your system processes. But don’t worry, multiprocessing has a helper for you: cpu_count

After this, you have to split up your input data between the processes. We can do this by splitting a list with the input information into a number of lists (number of your max. process).

Now you have everything to start the processes in a loop.

--

--