Providing quick access to ready-to-use Big Data solutions
Because Big Data doesn't have to be complicated
Javier Cacheiro / Cloudera Certified Developer for Spark & Hadoop / @javicacheiro
We don’t have better algorithms.
We just have more data.
Compute centric: bring the data to the computation
Data centric: bring the computation to the data
Jonathan Dursi: HPC is dying, and MPI is killing it
The VPN software allows you not only to connect to CESGA in a secure way but also to access internal resources that you can not reach otherwise.
Gateway: gateway.cesga.es
Port: 443
Username: your email address registered at CESGA
Password: your password in the supercomputers
unrar e vpn-fortissl.rar
tar xvzf forticlientsslvpn_linux_4.4.2323.tar.gz
cd forticlientsslvpn
./fortisslvpn.sh
Accept the license agreement presented
../forticlientsslvpn_cli \
--server gateway.cesga.es:443 \
--vpnuser usuario@dominio.com
For more info check the VPN section of the User Guide
Using a powerful CLI through SSH:
ssh username@hadoop3.cesga.es
Using a simple Web User Interface
No VPN needed if connecting from a Remote Desktop
Use our dedicated Data Transfer Node: dtn.srv.cesga.es
SCP/SFTP is fine for small transfers
But for large transfers use Globus
Our endpoint is cesga#dtn
For more info check the DTN User Guide
Direct SCP/SFTP to hadoop3.cesga.es
Useful only for internal transfers: eg. FT to BD
Not recommended for external transfers because it will be slowed down by the VPN server and the firewall
myquota
Defaults:
If you need additional space you can request Additional Storage
HDFS and HOMEBD do not have automatic backups configured
Only HOME FT has automatic daily backups
If you need to backup data in HDFS or HOMEBD contact us
We recommend that you use the discp tool
hadoop distcp -i -pat -update hdfs://10.121.13.19:8020/user/uscfajlc/wcresult hdfs://nameservice1/user/uscfajlc/wcresult
Run it from hadoop3.cesga.es so it takes into account HA
To upload a file from local disk to HDFS:
hdfs dfs -put file.txt file.txt
It will copy the file to /user/username/file.txt in HDFS.
To list files in HDFS:
hdfs dfs -ls
Lists the files in our HOME directory of HDFS /user/username/
To list files in the root directory:
hdfs dfs -ls /
Create a directory:
hdfs dfs -mkdir /tmp/test
Delete a directory:
hdfs dfs -rm -r -f /tmp/test
Read a file:
hdfs dfs -cat file.txt
Download a file from HDFS to local disk:
hdfs dfs -get fichero.txt
You can easily access the HUE File Explorer from the WebUI:
yarn jar application.jar DriverClass input output
yarn application -list
yarn top
yarn logs -applicationId applicationId
yarn application -kill applicationId
You can access the HUE Job Browser from the WebUI:
Now the main entry point is spark instead of sc and sqlContext
pyspark
module load anaconda2
PYSPARK_DRIVER_PYTHON=ipython pyspark
from pyspark.sql import Row
Person = Row('name', 'surname')
data = []
data.append(Person('Joe', 'MacMillan'))
data.append(Person('Gordon', 'Clark'))
data.append(Person('Cameron', 'Howe'))
df = spark.createDataFrame(data)
df.show()
+-------+---------+
| name| surname|
+-------+---------+
| Joe|MacMillan|
| Gordon| Clark|
|Cameron| Howe|
+-------+---------+
# client mode
spark-submit --master yarn \
--name testWC test.py input output
# cluster mode
spark-submit --master yarn --deploy-mode cluster \
--name testWC test.py input output
# client mode
spark-submit --master yarn --name testWC \
--class es.cesga.hadoop.Test test.jar \
input output
# cluster mode
spark-submit --master yarn --deploy-mode cluster \
--name testWC \
--class es.cesga.hadoop.Test test.jar \
input output
--num-executors NUM Number of executors to launch (Default: 2)
--executor-cores NUM Number of cores per executor. (Default: 1)
--driver-cores NUM Number of cores for driver (cluster mode)
--executor-memory MEM Memory per executor (Default: 1G)
--queue QUEUE_NAME The YARN queue to submit to (Default: "default")
--proxy-user NAME User to impersonate
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
start_jupyter
You can also use Jupyter with Python 3:
module load anaconda3/2018.12
start_jupyter
Just load the desired python version first
Keep in mind that CDH 6.1 does not officially support Python 3 yet
You can also try the new Jupyter Lab:
module load anaconda2/2018.12
start_jupyter-lab
Hive offers the possibility to use Hadoop through a SQL-like interface
You can use Hive from the WebUI through HUE:
beeline
beeline> !connect
jdbc:hive2://c14-19.bd.cluster.cesga.es:10000/default;
ssl=true;sslTrustStore=/opt/cesga/cdh61/hiveserver2.jks;
trustStorePassword=notsecret
The Hive CLI is not deprecated and not recommended:
hive
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ':'
create database if not exists uscfajlc;
use uscfajlc;
hdfs dfs -chmod go-rwx /user/hive/warehouse/uscfajlc.db
impala-shell --ssl --impalad=c14-2
Point it to any worker node in the cluster
Hive and Impala use the same SQL syntax HiveQL
sqoop list-tables \
--username ${USER} -P \
--connect jdbc:postgresql://${SERVER}/${DB}
sqoop import \
--username ${USER} --password ${PASSWORD} \
--connect jdbc:postgresql://${SERVER}/${DB} \
--table mytable \
--target-dir /user/username/mytable \
--num-mappers 1
sqoop import \
--username ${USER} --password ${PASSWORD} \
--connect jdbc:postgresql://${SERVER}/${DB} \
--table mytable \
--target-dir /user/username/mytable \
--num-mappers 1 \
--hive-import
sqoop create-hive-table \
--username ${USER} --password ${PASSWORD} \
--connect jdbc:postgresql://${SERVER}/${DB} \
--table mytable
First create table into PostgreSQL
sqoop export \
--username ${USER} --password ${PASSWORD} \
--connect jdbc:postgresql://${SERVER}/${DB} \
--table mytable \
--export-dir /user/username/mytable \
--input-fields-terminated-by '\001' \
--num-mappers 1
For MySQL and PosgreSQL for faster performance you can use direct mode (--direct option)
> library(sparklyr)
> sc <- spark_connect(master = "yarn-client", spark_home = Sys.getenv('SPARK_HOME'))
sc
connection to use spark from R > spark_disconnect(sc)
start_jupyter
spark-submit --class sparklyr.Shell '/opt/cesga/anaconda/Anaconda2-2018.12-sparklyr/lib/R/library/sparklyr/java/sparklyr-2.4-2.11.jar' 8880 1234 --batch example_sparklyr_script.R
gatk is available as a module on hadoop3.cesga.es
module load gatk
More info on how to use modules on the modules tutorial
gatk uses hdfs when you launch a SPARK tool, check the hdfs tutorial for instructions.
gatk only supports SPARK on some tools, SPARK tools always end with Spark. Check the tool list here.
gatk ToolName toolArguments -- --spark-runner SPARK --spark-master yarn additionalSparkArguments
additionalSparkArguments
can be used if a gatk job needs access to more resources, SPARK tutorial
gatk HaplotypeCallerSpark -L 1:1000000-2000000 -R hdfs://nameservice1/user/username/ref.fa -I hdfs://nameservice1/user/username/input.bam -O hdfs://nameservice1/user/username/output.vcf -- --spark-runner SPARK --spark-master yarn --driver-memory 4g --executor-memory 4g
gatk CalcMetadataSpark -I hdfs://nameservice1/user/username/input.bam -O hdfs://nameservice1/user/username/statistics.txt -- --spark-runner SPARK --spark-master yarn
You can use standard gatk tools on hadoop3.cesga.es for testing but its not recommended
For this use case, gatk is also available on the finisterrae for faster execution.
The PaaS service is intended to test, learn and develop using the products offered, not for continuous use.
Clusters must be destroyed when not needed anymore.
Clusters are designed for flexibility, if you need to use any of the products permanently, on a production environment, contact us for a custom solution.
Access the WebUI and log in.
In the menu o the left click on PaaS and then on Products to se the available products
Select a product and click on LAUNCH
Specify the resources that you need
Here you can get information on the clusters you launched
Destroy the clusters when they are no longer needed
All PaaS products should only be used for testing and developing purposes.
Destroy the clusters when you no longer need them, do not saturate the service.
10.121.243.250 hbase_node0
ping hbase_node0
You can use happybase. There is a 'test' table you can use to verify everything works:
import happybase
connection = happybase.Connection(host='10.121.243.250')
test = connection.table('test')
for k, d in test.scan(): print k, d
To connect using java you can use our Sample Java HBaseclient and modify the hbase.zookeeper.quorum to reflect the right IP of your instance
The example App can be built using maven
mvn package
java -jar target/hbaseclient-0.1.0.jar
All PaaS products should only be used for testing and developing purposes.
Destroy the clusters when you no longer need them, do not saturate the service.
Activate your VPN and connect to the webUI.
Go to PaaS > Products and launch a MongoDB cluster.
Select the resources you need.
For testing the product use only 1 CPU to save up on resources
When cluster status is ready click on show info and copy the IPs provided.
Form your computer:
mongo --host 10.112.243.28
Form hadoop3.cesga.es or ft.cesga.es:
mongo --host 10.117.243.28
All PaaS products should only be used for testing and developing purposes.
Destroy the clusters when you no longer need them, do not saturate the service.
Activate your VPN and connect to the webUI.
Go to PaaS > Products and launch a Cassandra cluster.
Select the resources you need. Minimun 3 nodes (an odd number is recommented) and 2 disks per node (Max 11 disks)
For testing the product use the minimun resources possible
When cluster status is ready click on show info and copy the IPs provided for the "cassandranode0"
You can access through cqlsh or use any other application, like python
Form your computer:
cqlsh 10.112.243.28
Form hadoop3.cesga.es or ft.cesga.es:
cqlsh 10.117.243.28
All PaaS products should only be used for testing and developing purposes.
Destroy the clusters when you no longer need them, do not saturate the service.
Activate your VPN and connect to the webUI.
Go to PaaS > Products and launch a MariaDB cluster.
Select the resources you need and provide a numeric password.
For testing the product use only 1 CPU to save up on resources
When cluster status is ready click on show info and copy the IPs provided.
Form your computer:
mysql -u root -p -h 10.112.243.28
Form hadoop3.cesga.es or ft.cesga.es:
mysql -u root -p -h 10.117.243.28
All PaaS products should only be used for testing and developing purposes.
Destroy the clusters when you no longer need them, do not saturate the service.
Activate your VPN and connect to the webUI.
Go to PaaS > Products and launch a PostgreSQL cluster, selecting your preferred version.
Select the resources you need and provide a numeric password.
For testing the product use only 1 CPU to save up on resources
When cluster status is ready click on show info and copy the IPs provided.
You need to connect as user postgresql and to the database test
You will then be asked for your numeric password
After initial connection you can create other databases using create database <databasename>
Form your computer:
psql -h 10.112.243.248 -U postgresql test
Form hadoop3.cesga.es or ft.cesga.es:
psql -h 10.117.243.248 -U postgresql test
This product includes 2 of the most popular extensions for managing time series and spatial data
After launching the product, the process of setting up and conecting is the same
In order to use the extensions, load them first in your db
CREATE EXTENSION IF NOT EXISTS postgis CASCADE;
CREATE EXTENSION IF NOT EXISTS timescaledb VERSION '1.7.5' CASCADE;
module available
We recommend that you use the Anaconda version of Python instead of the OS one
module load anaconda2/2018.12
Even you can try Python 3 (not officially supported)
module load anaconda3/2018.12
You can load the following build tools
Cloudera Official Documentation:
Reference Documentation for each component:
We have prepared a documentation pack including today's slides as well as all related material:
Big Data is a multi-disciplinary domain.
Collaboration is a key factor in Big Data projects.
Tell us your project and we will help you:
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
To launch a job:
yarn jar job.jar DriverClass input output
To list running MR jobs:
mapred job -list
To cancel a job:
mapred job -kill [jobid]
You can easily monitor your jobs using the YARN UI from the WebUI:
You can see finished jobs using the MR2 UI from the WebUI:
git clone https://github.com/bigdatacesga/mr-wordcount
# Download sources and javadoc
mvn dependency:sources
mvn dependency:resolve -Dclassifier=javadoc
# Update the existing Eclipse project
mvn eclipse:eclipse
# Or if you using Intellij IDEA
mvn idea:idea
Compile:
mvn compile
Run the tests
mvn test
Package your app
mvn package
If you prefer to compile and package manually:
javac -classpath $(hadoop classpath) *.java
jar cvf wordcount.jar *.class
Basic components of a program:
public class Driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(Driver.class);
job.setJobName("Word Count");
job.setMapperClass(WordMapper.class);
job.setCombinerClass(SumReducer.class);
job.setReducerClass(SumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
public class WordMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String field : line.split("\\W+")) {
if (field.length() > 0) {
word.set(field);
context.write(word, one);
}
}
}
}
public class SumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(
Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
yarn jar \
/usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-input input -output output \
-mapper mapper.pl -reducer reducer.pl \
-file mapper.pl -file reducer.pl
Low-level routines for performing common linear algebra operations
Adds support to Python for fast operations with multi-dimensional arrays and matrices
Already configured to use Intel MKL