Dealing with large data processing problems my main tools are as follows: Libs:...

Dealing with large data processing problems my main tools are as follows:

Libs: - Dask for distributed processing - matplotlib/seaborn for graphing - IPython/Jupyter for creating shareable data analyses

Environment: - S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet - EC2 for Dask clustering - Ansible for EC2 setup

My problems usually can be solved by 2 memory-heavy EC2 instances. This setup works really well for me. Reading and writing intermediate results to S3 is blazing fast, especially when partitioning data by days if you work with time series.

Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.

The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.

To sum it up, Dask alone makes all the difference for me!