Project3.1 Files and Databases
Goal: compare the efficiency of three kinds of queries:
- Query with .csv files using
grep
andawk
- Query with MySQL
- Query with HBase
Part 1. File query (skipped)
Part 2. MySQL
Configure database
Install MySQL using
1sudo apt-get install mysql-serverConnect to the root user, type your password following -p
12mysql -u root -ppasswordSHOW DATABASES;Create new user
1CREATE USER 'newuser'@'localhost' IDENTIFIED BY 'team';Grant partial or all priviledges and login again, here we grant all priviledges:
1GRANT PRIVILEDGES ON teamproj.* to 'newuser' @ 'localhost';Create database and table which support full unicode
1CREATE DATABASE blah;
Answer questions
Q8 & Q9
use index to dramatically decrease query time
Q10
- Here, we use
LIKE
to match a regular expression case insensitive, useLIKE BINARY
to match case sensitive, which is required here. Binary match a character’s ASC code instead of its value, so it’s an exact match. - As
release
is a reserved word in MySQL, we have to usesongs.release
to escape it. Notice, we may use'release'
(with single quotation mark) when we create a table, buttableName.columnName
in a query.
Q11
At the end of the query, we use AS filtered
otherwise, we will get Exception:Every derived table must have its own alias
Bonus: Load data
Log in with --local-infile
flag
and then execute
Part 3. Disk performance test
We benchmark the performance of FileIO using four settings, t1.micro and m3.large, magnetic disk and SSD.
|
|
Part 4. HBase
First, we build an EMR with one master and one core (slave), and SSH into the master core. You shoud be able to see your hard drive condition through these two commands
(PS: Don’t mistakenly ssh into the slave. You can find instance ID under EMR)
- Launch an EMR cluster with 1 master and 1 core node.
- Select “Go to advanced options” on the top of create cluster page
- Make sure all instances are m1.large.
- Make sure that the EMR cluster is created within the same VPC as your project instance (the one with your runner.sh).
- Choose AMI version 3.11.0 (hadoop version 2).
- Remove any existing services such as Pig and Hive and choose to install HBase version 0.94.
- You must specify a key-pair to be able to SSH to the individual instances. Note that the username to use while SSHing to these instances is hadoop.
- Do not forget to set a tag that will be automatically propagated to the individual instances in the cluster.
- Enable “termination protection” and “keep-alive” for your cluster.
- Set the security group of both master and core node to allow ALL traffic. 11. Use the Master public DNS as the SSH DNS.
After you SSH into master node, you can run the following command to verify that HDFS is healthy given how it’s being reported per datanode:
hadoop dfsadmin -report12sudo parted -ldf -hCreate a directory in guest machine (where you are right now) and move the .csv file to that location
123mkdir P3_1cd P3_1wget https://s3.amazonaws.com/15319-p31/million_songs_metadata.csvCreate a directory in HDFS to store the file
1hadoop fs -mkdir /csvPut the file into HDFS. HowToUse -put: -put /local/directory /hdfs/directory
1hadoop fs -put /home/hadoop/P3_1/million_songs_metadata.csv /csvCreate a new table in HBase by log in into shell using:
1hbase shellCreate a new table whose name is songdata and column family data
1create 'songdata','data'Exit shell, and gety ready the data for inputing it to HBase, you should see a process bar telling you the progress of input like:
[======> ]45%1hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,data:title,data:song_id,data:release,data:artist_id,data:artist_mbid,data:artist_name,data:duration,data:artist_familiarity,data:artist_hotttnesss,data:year" -Dimporttsv.separator="," -Dimporttsv.bulk.output=hdfs:///output songdata hdfs:///csvLoad the data into table
1hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs:///output songdataNow, scan the table to make sure data has been loaded successfully, use Ctrl+C to stop scanning
1scan 'songdata'Note down the private IP and enter it into the java file, run demo through commands
12javac HbaseTasks.javajava HbaseTasks demo
If you see the result of 96 rows, you could process to answer the questions now : )