One-day workshop on big-data using pyspark

Feb 3rd, 2017

9:00 am - 12:00 pm

Instructors: Anne Fouilloux

Helpers: Hugues Fontenelle

General Information

Software Carpentry aims to help researchers get their work done in less time and with less pain by teaching them basic research computing skills. This hands-on workshop will cover basic concepts and tools, including program design, version control, data management, and task automation. Participants will be encouraged to help one another and to apply what they have learned to their own research problems.

For more information on what we teach and why, please see our paper "Best Practices for Scientific Computing".

Who: The course is aimed at graduate students and other researchers. This one-day Carpentry@UiO hands-on workshop will give a short introduction to big data analysis using pyspark. The Spark Python API (PySpark) exposes the Spark programming model to Python. ApacheĀ® Sparkā„¢ is an open source and is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. It was developed to utilize distributed, in-memory data structures to improve data processing speeds. A basic knowledge of python is recommended but you don't need to have any previous knowledge of big data analysis or Apache Spark.

Where: FIXME. Get directions with OpenStreetMap or Google Maps.

Requirements: Participants must bring a laptop with a Mac, Linux, or Windows operating system (not a tablet, Chromebook, etc.) that they have administrative privileges on. They should have a few specific software packages installed (listed below). They are also required to abide by Software Carpentry's Code of Conduct.

Accessibility: We are committed to making this workshop accessible to everybody. The workshop organisers have checked that:

Materials will be provided in advance of the workshop and large-print handouts are available if needed by notifying the organizers in advance. If we can help making learning easier for you (e.g. sign-language interpreters, lactation facilities) please get in touch and we will attempt to provide them.

Contact: Please email contact-us@swcarpentry.uio.no for more information.


Schedule

Day 1

09:00 Introduction to Big data
10:00 MapReduce Programming Paradigm
10:30 Coffee
11:00 MapReduce Programming Paradigm
12:00 Wrap-up

Etherpad: http://pad.software-carpentry.org/2017-02-03-pyspark.
We will use this Etherpad for chatting, taking notes, and sharing URLs and bits of code.


Syllabus

Our lesson on Big-data using PySpark can be found here.

Setup

To ease our work and avoid installing Spark on your laptop, we will be using the UIO Galaxy eduPortal. If you haven't received a login and password yet, don't panic. This can be handled in few minutes during the workshop.
For the workshop you will need a web browser (firefox, google chrome or internet explorer) and be able to establish a wireless internet connection. For more information on how to connect to the wireless network at UIO, see "Connect to UIO wireless". If you are not affiliated with the University of Oslo and do not have an eduroam account, you can still use our guest WIFI network. See detailed instructions here.

Remark: without changing your pySpark code, you will be able to scale up your code to hundred processors on any cluster or HPC system. At the University of Olso, you may use the UIO HPC abel... See detailed information here.

It is not mandatory to install python on your laptop, but the first part of the lesson is done with pure python and this is why we suggest you install python on your laptop. To participate in a Software Carpentry workshop, you will need access to the software described below. In addition, you will need an up-to-date web browser.

Python

Python is a popular language for research computing, and great for general-purpose programming as well. Installing all of its research packages individually can be a bit difficult, so we recommend Anaconda, an all-in-one installer.

Regardless of how you choose to install it, please make sure you install Python version 3.x (e.g., 3.4 is fine).

We will teach Python using the IPython notebook, a programming environment that runs in a web browser. For this to work you will need a reasonably up-to-date browser. The current versions of the Chrome, Safari and Firefox browsers are all supported (some older browsers, including Internet Explorer version 9 and below, are not).

Windows

Video Tutorial
  1. Open http://continuum.io/downloads with your web browser.
  2. Download the Python 3 installer for Windows.
  3. Install Python 3 using all of the defaults for installation except make sure to check Make Anaconda the default Python.

Mac OS X

Video Tutorial
  1. Open http://continuum.io/downloads with your web browser.
  2. Download the Python 3 installer for OS X.
  3. Install Python 3 using all of the defaults for installation.

Linux

  1. Open http://continuum.io/downloads with your web browser.
  2. Download the Python 3 installer for Linux.
  3. Install Python 3 using all of the defaults for installation. (Installation requires using the shell. If you aren't comfortable doing the installation yourself stop here and request help at the workshop.)
  4. Open a terminal window.
  5. Type
    bash Anaconda3-
    and then press tab. The name of the file you just downloaded should appear.
  6. Press enter. You will follow the text-only prompts. When there is a colon at the bottom of the screen press the down arrow to move down through the text. Type yes and press enter to approve the license. Press enter to approve the default location for the files. Type yes and press enter to prepend Anaconda to your PATH (this makes the Anaconda distribution the default Python).