uber example

You downloaded a zip file at the beginning of the workshop. Time to use it!

First, navigate to the Downloads folder.

We need to unzip the zip file -- any guesses about what command unzips files?

unzip uber-trip-data-master.zip

By default, unzip will create a folder with the same name as the zip file (without the .zip part). If you want to unzip the file to a different location, you can use the -d flag.

For example:

unzip uber-trip-data-master.zip -d ~/uber-trip-data

Now you should have a folder called uber-trip-data or something similar, depending on where and how you unzipped it. cd to that directory and take a look.

$ cd uber-trip-data/
$ ls
taxi-zone-lookup.csv     uber-raw-data-janjune-15.csv.zip  uber-raw-data-may14.csv
uber-raw-data-apr14.csv  uber-raw-data-jul14.csv           uber-raw-data-sep14.csv
uber-raw-data-aug14.csv  uber-raw-data-jun14.csv

We have several raw data files (so named). csv files divided by month and also a zip file. We'll leave the zip file alone for the moment. What questions can we answer about this dataset from the command line?

First, it would help to know what these files contain. We can use head to display the first ten lines of one of the raw files.

$ head uber-raw-data-apr14.csv
"4/1/2014 0:11:00",40.769,-73.9549,"B02512"
"4/1/2014 0:17:00",40.7267,-74.0345,"B02512"
"4/1/2014 0:21:00",40.7316,-73.9873,"B02512"
"4/1/2014 0:28:00",40.7588,-73.9776,"B02512"
"4/1/2014 0:33:00",40.7594,-73.9722,"B02512"
"4/1/2014 0:33:00",40.7383,-74.0403,"B02512"
"4/1/2014 0:39:00",40.7223,-73.9887,"B02512"
"4/1/2014 0:45:00",40.762,-73.979,"B02512"
"4/1/2014 0:55:00",40.7524,-73.996,"B02512"

They weren't kidding when they said "raw" data. What do we have here? Looks like a date and time (these are pickup times), the lat-long of the pickup location and some "Base" id that we don't care about right now.

What can we take away from this?

For one, that the number of lines in this file is how many pickups Uber made during the month of April, 2014. That sounds like interesting information. We can use wc or wordcount to count the number of lines in a file by using the -l flag.

$ wc -l uber-raw-data-apr14.csv
564517 uber-raw-data-apr14.csv

We know now that Uber provided more than half a million rides in New York City in April of 2014. Let's take a look at how that figure changes month-to-month!

We can run the same wc command but now use the * wildcard to get the linecount of every file in the directory.

$ wc -l *
        0 taxi-zone-lookup.csv
   564517 uber-raw-data-apr14.csv
   829276 uber-raw-data-aug14.csv
   284595 uber-raw-data-janjune-15.csv.zip
   796122 uber-raw-data-jul14.csv
   663845 uber-raw-data-jun14.csv
   652436 uber-raw-data-may14.csv
  1028137 uber-raw-data-sep14.csv
  4818928 total

Cool, but the linecount of a zip file doesn't really make any sense. To be a little more specific, restrict wc to only look at csv files.

$ wc -l *.csv
        0 taxi-zone-lookup.csv
   564517 uber-raw-data-apr14.csv
   829276 uber-raw-data-aug14.csv
   796122 uber-raw-data-jul14.csv
   663845 uber-raw-data-jun14.csv
   652436 uber-raw-data-may14.csv
  1028137 uber-raw-data-sep14.csv
  4534333 total

Better. wc outputs files in the order it gets them, which in this case is the order they exist in the directory. And that's alphabetical order. The names of months aren't hugely useful when sorting alphabetically.

Let's use the sort command to sort the results from wc. We can pipe the output of wc to sort using the | character. Remember, the pipe takes the output from the previous command and hands it off to the following command.

$ wc -l *.csv | sort
        0 taxi-zone-lookup.csv
  1028137 uber-raw-data-sep14.csv
  4534333 total
   564517 uber-raw-data-apr14.csv
   652436 uber-raw-data-may14.csv
   663845 uber-raw-data-jun14.csv
   796122 uber-raw-data-jul14.csv
   829276 uber-raw-data-aug14.csv

Hmmm. That looks a little funny. Can you see what sort did?

Yeah, it sorted things alphanumerically, which isn't helpful since it only looks at leading digits. We want to use the number of lines in each file as the sorting criteria.

To do this, we can use the -n flag with sort to specify a "numerical" sort.

This may start to look a little confusing, but remember, we're just building up a command using smaller commands. We use the -l flag with wc to count the number of lines, then pipe that output to sort where we use the -n flag to require numerical sorting.

$ wc -l *.csv | sort -n
        0 taxi-zone-lookup.csv
   564517 uber-raw-data-apr14.csv
   652436 uber-raw-data-may14.csv
   663845 uber-raw-data-jun14.csv
   796122 uber-raw-data-jul14.csv
   829276 uber-raw-data-aug14.csv
  1028137 uber-raw-data-sep14.csv
  4534333 total

Ok! Now we have the raw data files sorted by number of lines, which we know is equivalent to number of rides. And look, now the months are in the correct order. That wasn't necessarily expected, but looking at this output we can see that Uber is expanding at a pretty fast pace in 2014; they nearly doubled their usage numbers in 5 months!

This data set is missing the last quarter of 2014, but we have the first half of 2015 available, so we can check if the trend continues (is there a ceiling for Uber requests in NYC?)

First, unzip the file containing the 2015 data.

$ unzip uber-raw-data-janjune-15.csv.zip
Archive:  uber-raw-data-janjune-15.csv.zip
  inflating: uber-raw-data-janjune-15.csv
   creating: __MACOSX/
  inflating: __MACOSX/._uber-raw-data-janjune-15.csv

Now we know that whoever created this zip file uses a Mac. But that's not really important. Let's take another look at the line counts.

$ wc -l *.csv | sort -n
        0 taxi-zone-lookup.csv
   564517 uber-raw-data-apr14.csv
   652436 uber-raw-data-may14.csv
   663845 uber-raw-data-jun14.csv
   796122 uber-raw-data-jul14.csv
   829276 uber-raw-data-aug14.csv
  1028137 uber-raw-data-sep14.csv
 14270480 uber-raw-data-janjune-15.csv
 18804813 total

Wow! 14+ million rides! Impressive! But this data layout is different from the 2014 data. The six months are all in the same file. Not cool. So what now?

First, let's see what the data looks like in the combined file.

$ head uber-raw-data-janjune-15.csv
B02617,2015-05-17 09:47:00,B02617,141
B02617,2015-05-17 09:47:00,B02617,65
B02617,2015-05-17 09:47:00,B02617,100
B02617,2015-05-17 09:47:00,B02774,80
B02617,2015-05-17 09:47:00,B02617,90
B02617,2015-05-17 09:47:00,B02617,228
B02617,2015-05-17 09:47:00,B02617,7
B02617,2015-05-17 09:47:00,B02764,74
B02617,2015-05-17 09:47:00,B02617,249

Very uncool. First, the data is in a different format than the previous files we looked at. Worse, the first pickup listed is in May? Either the file is mislabeled (bad) or it isn't even sorted (bad).

Let's look at a few more lines to see if we can figure out which bad scenario we have.

We can use tail to peek at the last 10 lines in the file. How do those look?

$ tail uber-raw-data-janjune-15.csv
B02765,2015-05-08 15:42:00,B02764,79
B02765,2015-05-08 15:42:00,B02765,37
B02765,2015-05-08 15:42:00,B02765,161
B02765,2015-05-08 15:42:00,B02765,7
B02765,2015-05-08 15:43:00,B02711,25
B02765,2015-05-08 15:43:00,B02765,186
B02765,2015-05-08 15:43:00,B02765,263
B02765,2015-05-08 15:43:00,B02765,90
B02765,2015-05-08 15:44:00,B01899,45
B02765,2015-05-08 15:44:00,B02682,144

Not looking good. Is this all just in May? Let's look through a larger number of lines using head and see if we can find a ride that wasn't in May. Use the -n flag with head to specify the number of lines to show (the default is 10).

$ head -n 500 uber-raw-data-janjune-15.csv
B02598,2015-01-18 11:06:58,B02598,7
B02598,2015-01-18 18:55:46,B02598,141
B02598,2015-01-18 14:54:28,B02598,249
B02598,2015-01-18 20:48:57,B02598,90
B02598,2015-01-18 09:28:20,B02682,234
B02598,2015-01-18 19:31:14,B02764,13
B02598,2015-01-18 14:13:38,B02598,163
B02598,2015-01-18 22:53:57,B02598,90
B02598,2015-01-18 19:13:00,B02617,246
B02598,2015-01-18 14:53:36,B02598,161
B02598,2015-01-18 02:37:00,B02598,114
B02598,2015-01-18 18:47:01,B02598,113
B02598,2015-01-18 16:06:11,B02598,233
B02598,2015-01-18 15:36:12,B02598,162
B02598,2015-01-18 02:10:39,B02598,50
B02598,2015-01-18 12:18:57,B02764,142
B02598,2015-01-18 14:03:01,B02598,37

Ok. 500 lines in, we can see some January pickups. It looks like we have bad option #2. The data is labeled correctly but isn't sorted. Time to sort it!

Sorting can be expensive, so rather than trying to sort the whole file at once, let's copy a portion of the big file into a separate file.

Use the same head command we just used, but now, instead of writing it to the screen, we can redirect that output to another file using >. We'll just call that file test.csv.

$ head -n 500 uber-raw-data-janjune-15.csv > test.csv

Now it's time to figure out how to sort this data. We can use sort the way we did with wc because the information we want to use as the sort key (the date and time) are embedded in the middle of every line.

Here's one line from test.csv:

B02598,2015-01-18 14:03:01,B02598,37

We already looked at using sort with fields and the -k flag. Let's try it here:

We're going to cat all of test.csv, pipe that into sort and then use the -k2 flag, which will sort the lines of test.csv based on the first character of the second field/column.

$ cat test.csv | sort -k2 | less

That... didn't work. The default field delimiter in sort is whitespace, so the previous command sorted everything based on pickup time, but ignored pickup date.

How can we change the delimiter character that sort uses? Let's check the man page.

$ man sort
$ cat test.csv | sort -t "," -k2 | less
$ cat test.csv | sort -t "," -k2 > test_sort.csv
$ head test_sort.csv
B02598,2015-01-18 00:02:54,,144

B02598,2015-01-18 00:05:05,B02598,50

B02598,2015-01-18 00:06:19,B02598,107

B02598,2015-01-18 00:08:14,B02598,142

B02598,2015-01-18 00:16:58,B02598,107

B02598,2015-01-18 00:30:59,B02598,50

B02598,2015-01-18 00:36:16,B02598,211

B02598,2015-01-18 00:37:16,B02774,141

B02598,2015-01-18 00:45:16,,48

B02598,2015-01-18 00:47:08,B02617,68