Project3.1 Files and Databases

Goal: compare the efficiency of three kinds of queries:

Query with .csv files using grep and awk
Query with MySQL
Query with HBase

Part 1. File query (skipped)

Part 2. MySQL

Configure database

Install MySQL using
1
sudo apt-get install mysql-server
Connect to the root user, type your password following -p
1
2
mysql -u root -ppassword
SHOW DATABASES;

Create new user

1	CREATE USER 'newuser'@'localhost' IDENTIFIED BY 'team';

Grant partial or all priviledges and login again, here we grant all priviledges:
1
GRANT PRIVILEDGES ON teamproj.* to 'newuser' @ 'localhost';
Create database and table which support full unicode
1
CREATE DATABASE blah;

Answer questions

Q8 & Q9

use index to dramatically decrease query time

Q10

Here, we use LIKE to match a regular expression case insensitive, use LIKE BINARY to match case sensitive, which is required here. Binary match a character’s ASC code instead of its value, so it’s an exact match.
As release is a reserved word in MySQL, we have to use songs.release to escape it. Notice, we may use 'release' (with single quotation mark) when we create a table, but tableName.columnName in a query.

Q11

At the end of the query, we use AS filtered otherwise, we will get Exception:Every derived table must have its own alias

1	SELECT artist_name FROM (SELECT artist_id, artist_name, COUNT(artist_id) AS frequency FROM songs GROUP BY artist_id ORDER BY frequency DESC LIMIT 2,1) AS filtered

Bonus: Load data

1	mysql -u root -pdb15319root song_db --local-infile

and then execute

1	load data infile 'million_songs_metadata.csv' into table songs fields terminated by ','

Part 3. Disk performance test

We benchmark the performance of FileIO using four settings, t1.micro and m3.large, magnetic disk and SSD.

t1.micro magnetic
136.24
179.84
193.00
t1.micro SSD
629.33
627.00
626.69
m3.large magnetic
1168.99
1336.60
1484.45
m3.large SSD
1959.00
2279.25
2328.66

Part 4. HBase

First, we build an EMR with one master and one core (slave), and SSH into the master core. You shoud be able to see your hard drive condition through these two commands
(PS: Don’t mistakenly ssh into the slave. You can find instance ID under EMR)

Launch an EMR cluster with 1 master and 1 core node.
1. Select “Go to advanced options” on the top of create cluster page
2. Make sure all instances are m1.large.
3. Make sure that the EMR cluster is created within the same VPC as your project instance (the one with your runner.sh).
4. Choose AMI version 3.11.0 (hadoop version 2).
5. Remove any existing services such as Pig and Hive and choose to install HBase version 0.94.
6. You must specify a key-pair to be able to SSH to the individual instances. Note that the username to use while SSHing to these instances is hadoop.
7. Do not forget to set a tag that will be automatically propagated to the individual instances in the cluster.
8. Enable “termination protection” and “keep-alive” for your cluster.
Set the security group of both master and core node to allow ALL traffic. 11. Use the Master public DNS as the SSH DNS.
After you SSH into master node, you can run the following command to verify that HDFS is healthy given how it’s being reported per datanode:
hadoop dfsadmin -report
1
2
sudo parted -l
df -h
Create a directory in guest machine (where you are right now) and move the .csv file to that location
1
2
3
mkdir P3_1
cd P3_1
wget https://s3.amazonaws.com/15319-p31/million_songs_metadata.csv
Create a directory in HDFS to store the file
1
hadoop fs -mkdir /csv
Put the file into HDFS. HowToUse -put: -put /local/directory /hdfs/directory
1
hadoop fs -put /home/hadoop/P3_1/million_songs_metadata.csv /csv
Create a new table in HBase by log in into shell using:
1
hbase shell
Create a new table whose name is songdata and column family data
1
create 'songdata','data'

Exit shell, and gety ready the data for inputing it to HBase, you should see a process bar telling you the progress of input like:
[======> ]45%

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,data:title,data:song_id,data:release,data:artist_id,data:artist_mbid,data:artist_name,data:duration,data:artist_familiarity,data:artist_hotttnesss,data:year" -Dimporttsv.separator="," -Dimporttsv.bulk.output=hdfs:///output songdata hdfs:///csv

Load the data into table

1	hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs:///output songdata

Now, scan the table to make sure data has been loaded successfully, use Ctrl+C to stop scanning
1
scan 'songdata'
Note down the private IP and enter it into the java file, run demo through commands
1
2
javac HbaseTasks.java
java HbaseTasks demo

If you see the result of 96 rows, you could process to answer the questions now : )