uber example
You downloaded a zip file at the beginning of the workshop. Time to use it!
First, navigate to the Downloads
folder.
We need to unzip the zip file -- any guesses about what command unzips files?
unzip uber-trip-data-master.zip
By default, unzip
will create a folder with the same name as the zip
file
(without the .zip
part). If you want to unzip the file to a different
location, you can use the -d
flag.
For example:
unzip uber-trip-data-master.zip -d ~/uber-trip-data
Now you should have a folder called uber-trip-data
or something similar,
depending on where and how you unzipped it. cd
to that directory and take a
look.
$ cd uber-trip-data/
$ ls
taxi-zone-lookup.csv uber-raw-data-janjune-15.csv.zip uber-raw-data-may14.csv
uber-raw-data-apr14.csv uber-raw-data-jul14.csv uber-raw-data-sep14.csv
uber-raw-data-aug14.csv uber-raw-data-jun14.csv
We have several raw data files (so named). csv
files divided by month and also
a zip file. We'll leave the zip file alone for the moment. What questions can we
answer about this dataset from the command line?
First, it would help to know what these files contain. We can use head
to
display the first ten lines of one of the raw files.
$ head uber-raw-data-apr14.csv
"Date/Time","Lat","Lon","Base"
"4/1/2014 0:11:00",40.769,-73.9549,"B02512"
"4/1/2014 0:17:00",40.7267,-74.0345,"B02512"
"4/1/2014 0:21:00",40.7316,-73.9873,"B02512"
"4/1/2014 0:28:00",40.7588,-73.9776,"B02512"
"4/1/2014 0:33:00",40.7594,-73.9722,"B02512"
"4/1/2014 0:33:00",40.7383,-74.0403,"B02512"
"4/1/2014 0:39:00",40.7223,-73.9887,"B02512"
"4/1/2014 0:45:00",40.762,-73.979,"B02512"
"4/1/2014 0:55:00",40.7524,-73.996,"B02512"
They weren't kidding when they said "raw" data. What do we have here? Looks like a date and time (these are pickup times), the lat-long of the pickup location and some "Base" id that we don't care about right now.
What can we take away from this?
For one, that the number of lines in this file is how many pickups Uber made
during the month of April, 2014. That sounds like interesting information. We
can use wc
or wordcount
to count the number of lines in a file by using
the -l
flag.
$ wc -l uber-raw-data-apr14.csv
564517 uber-raw-data-apr14.csv
We know now that Uber provided more than half a million rides in New York City in April of 2014. Let's take a look at how that figure changes month-to-month!
We can run the same wc
command but now use the *
wildcard to get the
linecount of every file in the directory.
$ wc -l *
0 taxi-zone-lookup.csv
564517 uber-raw-data-apr14.csv
829276 uber-raw-data-aug14.csv
284595 uber-raw-data-janjune-15.csv.zip
796122 uber-raw-data-jul14.csv
663845 uber-raw-data-jun14.csv
652436 uber-raw-data-may14.csv
1028137 uber-raw-data-sep14.csv
4818928 total
Cool, but the linecount of a zip file doesn't really make any sense. To be a
little more specific, restrict wc
to only look at csv
files.
$ wc -l *.csv
0 taxi-zone-lookup.csv
564517 uber-raw-data-apr14.csv
829276 uber-raw-data-aug14.csv
796122 uber-raw-data-jul14.csv
663845 uber-raw-data-jun14.csv
652436 uber-raw-data-may14.csv
1028137 uber-raw-data-sep14.csv
4534333 total
Better. wc
outputs files in the order it gets them, which in this case is the
order they exist in the directory. And that's alphabetical order. The names of
months aren't hugely useful when sorting alphabetically.
Let's use the sort
command to sort the results from wc
. We can pipe the
output of wc
to sort
using the |
character. Remember, the pipe takes the
output from the previous command and hands it off to the following command.
$ wc -l *.csv | sort
0 taxi-zone-lookup.csv
1028137 uber-raw-data-sep14.csv
4534333 total
564517 uber-raw-data-apr14.csv
652436 uber-raw-data-may14.csv
663845 uber-raw-data-jun14.csv
796122 uber-raw-data-jul14.csv
829276 uber-raw-data-aug14.csv
Hmmm. That looks a little funny. Can you see what sort
did?
Yeah, it sorted things alphanumerically, which isn't helpful since it only looks at leading digits. We want to use the number of lines in each file as the sorting criteria.
To do this, we can use the -n
flag with sort
to specify a "numerical" sort.
This may start to look a little confusing, but remember, we're just building up
a command using smaller commands. We use the -l
flag with wc
to count the
number of lines, then pipe that output to sort
where we use the -n
flag to
require numerical sorting.
$ wc -l *.csv | sort -n
0 taxi-zone-lookup.csv
564517 uber-raw-data-apr14.csv
652436 uber-raw-data-may14.csv
663845 uber-raw-data-jun14.csv
796122 uber-raw-data-jul14.csv
829276 uber-raw-data-aug14.csv
1028137 uber-raw-data-sep14.csv
4534333 total
Ok! Now we have the raw data files sorted by number of lines, which we know is equivalent to number of rides. And look, now the months are in the correct order. That wasn't necessarily expected, but looking at this output we can see that Uber is expanding at a pretty fast pace in 2014; they nearly doubled their usage numbers in 5 months!
This data set is missing the last quarter of 2014, but we have the first half of 2015 available, so we can check if the trend continues (is there a ceiling for Uber requests in NYC?)
First, unzip the file containing the 2015 data.
$ unzip uber-raw-data-janjune-15.csv.zip
Archive: uber-raw-data-janjune-15.csv.zip
inflating: uber-raw-data-janjune-15.csv
creating: __MACOSX/
inflating: __MACOSX/._uber-raw-data-janjune-15.csv
Now we know that whoever created this zip file uses a Mac. But that's not really important. Let's take another look at the line counts.
$ wc -l *.csv | sort -n
0 taxi-zone-lookup.csv
564517 uber-raw-data-apr14.csv
652436 uber-raw-data-may14.csv
663845 uber-raw-data-jun14.csv
796122 uber-raw-data-jul14.csv
829276 uber-raw-data-aug14.csv
1028137 uber-raw-data-sep14.csv
14270480 uber-raw-data-janjune-15.csv
18804813 total
Wow! 14+ million rides! Impressive! But this data layout is different from the 2014 data. The six months are all in the same file. Not cool. So what now?
First, let's see what the data looks like in the combined file.
$ head uber-raw-data-janjune-15.csv
Dispatching_base_num,Pickup_date,Affiliated_base_num,locationID
B02617,2015-05-17 09:47:00,B02617,141
B02617,2015-05-17 09:47:00,B02617,65
B02617,2015-05-17 09:47:00,B02617,100
B02617,2015-05-17 09:47:00,B02774,80
B02617,2015-05-17 09:47:00,B02617,90
B02617,2015-05-17 09:47:00,B02617,228
B02617,2015-05-17 09:47:00,B02617,7
B02617,2015-05-17 09:47:00,B02764,74
B02617,2015-05-17 09:47:00,B02617,249
Very uncool. First, the data is in a different format than the previous files we looked at. Worse, the first pickup listed is in May? Either the file is mislabeled (bad) or it isn't even sorted (bad).
Let's look at a few more lines to see if we can figure out which bad scenario we have.
We can use tail
to peek at the last 10 lines in the file. How do those look?
$ tail uber-raw-data-janjune-15.csv
B02765,2015-05-08 15:42:00,B02764,79
B02765,2015-05-08 15:42:00,B02765,37
B02765,2015-05-08 15:42:00,B02765,161
B02765,2015-05-08 15:42:00,B02765,7
B02765,2015-05-08 15:43:00,B02711,25
B02765,2015-05-08 15:43:00,B02765,186
B02765,2015-05-08 15:43:00,B02765,263
B02765,2015-05-08 15:43:00,B02765,90
B02765,2015-05-08 15:44:00,B01899,45
B02765,2015-05-08 15:44:00,B02682,144
Not looking good. Is this all just in May? Let's look through a larger number of lines using head
and see if we can find a ride that wasn't in May. Use the -n
flag with head
to specify the number of lines to show (the default is 10).
$ head -n 500 uber-raw-data-janjune-15.csv
[snip]
B02598,2015-01-18 11:06:58,B02598,7
B02598,2015-01-18 18:55:46,B02598,141
B02598,2015-01-18 14:54:28,B02598,249
B02598,2015-01-18 20:48:57,B02598,90
B02598,2015-01-18 09:28:20,B02682,234
B02598,2015-01-18 19:31:14,B02764,13
B02598,2015-01-18 14:13:38,B02598,163
B02598,2015-01-18 22:53:57,B02598,90
B02598,2015-01-18 19:13:00,B02617,246
B02598,2015-01-18 14:53:36,B02598,161
B02598,2015-01-18 02:37:00,B02598,114
B02598,2015-01-18 18:47:01,B02598,113
B02598,2015-01-18 16:06:11,B02598,233
B02598,2015-01-18 15:36:12,B02598,162
B02598,2015-01-18 02:10:39,B02598,50
B02598,2015-01-18 12:18:57,B02764,142
B02598,2015-01-18 14:03:01,B02598,37
Ok. 500 lines in, we can see some January pickups. It looks like we have bad option #2. The data is labeled correctly but isn't sorted. Time to sort it!
Sorting can be expensive, so rather than trying to sort the whole file at once, let's copy a portion of the big file into a separate file.
Use the same head
command we just used, but now, instead of writing it to the
screen, we can redirect that output to another file using >
. We'll just call
that file test.csv
.
$ head -n 500 uber-raw-data-janjune-15.csv > test.csv
Now it's time to figure out how to sort this data. We can use sort
the way we
did with wc
because the information we want to use as the sort key (the date
and time) are embedded in the middle of every line.
Here's one line from test.csv
:
B02598,2015-01-18 14:03:01,B02598,37
We already looked at using sort
with fields
and the -k
flag. Let's try it
here:
We're going to cat
all of test.csv
, pipe that into sort
and then use the
-k2
flag, which will sort the lines of test.csv
based on the first character
of the second field/column.
$ cat test.csv | sort -k2 | less
That... didn't work. The default field delimiter in sort
is whitespace, so the
previous command sorted everything based on pickup time, but ignored pickup
date.
How can we change the delimiter character that sort
uses? Let's check the
man
page.
$ man sort
$ cat test.csv | sort -t "," -k2 | less
$ cat test.csv | sort -t "," -k2 > test_sort.csv
$ head test_sort.csv
B02598,2015-01-18 00:02:54,,144
B02598,2015-01-18 00:05:05,B02598,50
B02598,2015-01-18 00:06:19,B02598,107
B02598,2015-01-18 00:08:14,B02598,142
B02598,2015-01-18 00:16:58,B02598,107
B02598,2015-01-18 00:30:59,B02598,50
B02598,2015-01-18 00:36:16,B02598,211
B02598,2015-01-18 00:37:16,B02774,141
B02598,2015-01-18 00:45:16,,48
B02598,2015-01-18 00:47:08,B02617,68