Hadoop Lessons

List GCS files in 3 steps using Python program

1) We should install google-cloud-storage package to access Google Cloud storage in our Python program.

Just run the following pip install command to install google-cloud-storage

pip install google-cloud-storage

2) The first thing to do to access GCS from your Python program is authentication.

We'll create a credentials JSON file and use it for authentication.

You can generate your own credentials JSON file using this link

Generate credentials JSON file

You can also follow this link to generate JSON file.

http://www.hadooplessons.info/2023/03/simple-way-to-generate-json-key-for.html

Here are the crucial steps to write a Python program for accessing GCS:

create a client

client = storage.Client()

get access to the existing bucket in GCS

bucket = client.get_bucket(bucket_name)

create file in the gcs bucket and write data into it.

blob = bucket.blob(file_name)

blob.upload_from_string(text)

The full code is given below.

from google.cloud import storage

import os

#from google.oauth2 import service_account

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "application_default_credentials.json"

#credentials=service_account.Credentials.from_service_account_file('./application_default_credentials.json')

def list_gcs_blobs(project_id,bucket_id):

client = storage.Client(project=project_id)

blobs=client.list_blobs(bucket_id)

return blobs

project_id='startbig'

bucket_id='pubtest1'

blobs=list_gcs_blobs(project_id,bucket_id=bucket_id)

for blob in blobs:

print('blob name is',blob)

Modify project_id,bucket_id and run the program. Please note you need to keep credentials JSON file in the same directory this program is written otherwise change the path in os.environ line.

Top 10 key differences between Java and Python

Top two programming languages used in the software industry are Python and Java.Python tops the list and Java is close to Python.

Are you curious about questions such as which programming language, Java or Python, is faster?

Wondering which language is superior for web development or overall usage? This article delves into the 10 crucial distinctions that every Java and Python developer should be aware of.

Java is a compiled language. Entire Java code is converted into bytecode before it is run on a machine. This makes Java faster.

Whereas Python is an interpreted language, Python translates one line after another while is being run on the machine. this makes Python slower

Java is a static-typed language. Every variable requires a datatype to be defined in Java and every method should have a return type.

Unlike Java, Python is a dynamically typed language. Python does not need a datatype for a variable and does not need a return type for a function.

Java supports multiple inheritance partially through Interfaces. Python supports multiple inheritance fully. Python addresses the diamond problem using method resolution order.

Java has several options for web development. Many frameworks like Spring support web development in Java.

Python has fewer options for web development. Tools like FLASK and Django provide web development in Python.

Python has advanced tools like Pandas and numpy for data analysis. data analysis tools in Java are not as popular as pandas and numpy.

Java is the best option for database integration. Java has many ORM tools like hibernate and also Java supports enterprise applications.

Python has ORM tools like SQLAlchemy not many enterprise applications use Python.

As discussed, Java is a static-typed language. Rapid prototyping and rapid development is difficult in Java. In Python prototyping and development tasks are faster.

Java is fast and Python is slow. Python is an interpreted language and it translates code line after line. Python is slow because of that.

Java is relatively less known for Artificial intelligence and Machine Learning applications. Python community has a rich library for artificial intelligence and machine learning applications.

10.

Java is well for mobile development. Whereas Python is not well-known for mobile development.

These are the top 10 differences between Java and Python. Let me know if you know of any more differences.

Simple way to generate JSON key for google service account

Cloud technologies became one of the important skill sets in the job market nowadays. It could be Google Cloud, Microsoft Azure, or Amazon cloud technologies. In this article, we will learn the easy way to generate JSON key files for a Google service account.

1. Go to service accounts under IAM & Admin menu

After opening Google Cloud Console, Click on the 3 horizontal lines next to Google Cloud in the top left corner, Later click on the IAM and Admin menu from there go to service accounts and click on it.

Select relevant Google cloud project if already not done.

locate service accounts options under Iam and admin menu in the google cloud

Locate service accounts under IAM & Admin

2. Create a new service account

From here, We have two options for generating JSON files for service accounts. We can generate JSON files for existing service accounts or We can create a service account and generate JSON key file for that. We will see both options in this article. First We will create a service account and then go to JSON key file generation.

in the middle of the page, We will see Create service account option, click on that.

Create a service account under IAM and Admin in the google cloud

3. Enter Service account details

Enter a name for the service account, and the service ID is auto-generated. You can enter some text as a description for the service account.

Service account name, ID, and Description in the google cloud

for example, This service account is used for Airflow.

4. Choose Roles for the Service account

In the second step, you can also select some roles for the gcloud service account. for example, we have selected a browser role for our service account dummy.

Now click on done, and a service account called dummy is created for you. We have several other options while creating a service account, we will cover them in another article.

5. JSON key generation

Now Go to the service account you just created and click on the keys menu and also click on the add key option.

Keys for the service accountin google cloud

From there choose to create a new key option.

JSON option is auto-selected and we just need to click on the create option.

JSON file is created and auto-downloaded to the local downloads folder.

Json key generation for a service account in the Google Cloud

This JSON file is used in several places for authentication. for example, you can use this JSON file in Tableau software to authenticate and also we can use this JSON file in the Python program to authenticate to google cloud services.

Let me know if you have any queries. You can also check this video for the same help.

Generate JSON file in GCP

The Art of social media: Power tips for Power users Book review

For individuals and companies, It is impossible to build a brand without using social media platforms nowadays. I recently read The art of social media power tips for power users book is written by Guy Kawasaki and Peg FitzPatrick.

The Art of social media: Power tips for power users

The first reason I choose this book is the rich experience of Guy Kawasaki and Peg Fitzpatrick in social media management. He is a chief evangelist at Canva designing company, and He was a chief evangelist at Apple and Motorola. Peg is a Social media strategist and director at Kruessler inc. She had successfully delivered many social media campaigns for Google, Motorola, Audi, and Canva.

This book talks about how to lay a strong social media foundation both for an individual and company. This book lists around 120 easy to read steps about social media platforms like Facebook, Google+, Twitter, Instagram, LinkedIn, YouTube, SlideShare, and Pinterest.

All these 120 steps categories into how to write a great profile, how to write a great content, How to write a great post, How to manage comments and How to integrate social media and Blogging, How to run live programs, How to run google hangouts, How to avoid overconfidence and how to optimize for individual platforms.

In the end, This book takes a practical case study (How to launch a book) and explains How to use all these tips and tricks. This Book also lists tools and websites wherever necessary. But most of the tools like Likealyzer, PostPlanner, and Klout had stopped their service as of now.

This is one of the disappointments for me regarding this book.

Some important points I learned from this book.

Create a strong profile on every social media platform that should give a great impression for the opposite person under 5 seconds.
Plan your content and schedule it using tools like SproutSocial, HootSuite, or Buffer.
Content curation is important as much as a content generation.
Make use of Facebook groups, Google+ circles/communities, and Twitter Lists to spread the content and brand on social media.
Try to tell a visual story as much as possible.
Write a post that gives value to the user and avoid writing if you can not add value to the user.
Do not forget to include hashtags and increase discoverability.
Write posts in as many languages as possible
Be positive while replying to comments and do not take anything personally.
If required delete a comment, block, or report a person.
Add a share button to your blog and add social media posts to your blog.
Create a mail list and share your blog articles with your mail list.
Don't call yourself an expert and Don't ask your readers to share your posts.
Use personalized connection request on the LinkedIn
Create a Slideshare presentation and share it on Linkedin.
Start collaborative boards on Pinterest
Create a visual story on Twitter
Use popular hashtags on Instagram
Interact with other pages as much as possible.
Integrate your Facebook page with the Instagram page.

The book contains 120 highly practical tips and tricks mentioned above.

Few missing things in this book are like below.

Most of the time, This book talks only about a strong social media foundation but does not talk about growth activities/tips for the social media brand.
This book does not talk about the strategy required for creating hashtags on social media platforms.
Most of the tools like LikeAlyzer, Post planner and Klout mentioned in this book are not present as of now.

Overall This book is highly recommended for building great social media brands.

Adding new queue to capacity scheduler

1) Click on Active RM UI

2) In the RM UI , click scheduler

Confirm we have a single queue called default

3) Open queue manager view from Ambari.

and confirm you have a single queue called default

4) Add queue

5) Modify the default queue

6) Modify new queue

7) Save and refresh queue

8) Confirm new queue in ambari queue manager

9) Confirm new queue in RM UI

How to clear Hortonworks administrator (HDPCA) certification?

In Hadoop administration, We have three certifications provided by MapR (MCCA), Cloudera (CCA) and Hortonworks (HDPCA).I have cleared Hortonworks administrator certification (HDPCA) and also I helped many of my friends to clear the Hortonworks administrator (HDPCA) exam . In that process , I have learned many things about this exam .

I would like to share few tips on how to clear HDPCA exam. May be these tips will help you in clearing the certification.

Exam will provide 2 hours of time to resolve practical tasks on a HDP 2.3 cluster. Exam result does not provide any score or detailed explanation why you have failed. Exam result just provides either you passed or failed. It is better to follow few tips mentioned below.

Start early

Try to come online for exam at least 30 minutes before your actual exam time. Hortonworks will send you exam link before 15 minutes , You need to click the link for further instructions about the exam. Once you click on the link , Your examiner will join in webex like environment and will complete pre-checks for the exam. Pre-checks will include checking candidate photo identity card and checking room environment. You need to come online at least 30 minutes before to complete all these formalities.

Cluster

Hortonworks provided cluster could be slow. Use your time wisely. Certification will have GUI base tasks and terminal based tasks. If web UI is taking time to load, Go to terminal to perform another task.

Time

Due to slow cluster , You may not be able to complete all tasks in time. Talk to your examiner about cluster slowness, He could give you some more time for completing your other tasks.

Verify answers

If you get fast cluster, You can even complete the tasks much before your total exam time . It is good idea to revisit tasks to verify how did you resolve your tasks.

Read tasks carefully

It is very important to read tasks carefully. Every task will complete information on what is expected from you. For example task will mention you that you need to install node manager on the host master1. if you do not read tasks carefully , you might install node manager on the host master2.

Go to master node first

When your exam starts , you will given gateway node access. You will also be given all credentials for many hosts in the cluster. Do not forget to ssh to master node of the cluster . Even though it is known thing for many people , some times we forget about that. at least I forgot to ssh to first master node and was trying to resolve some tasks on gateway node itself.

Try mock exam on AWS

Mock exam on AWS is almost same as original exam. It is recommended to try mock exam before taking actual exam.Mock exam will give you real exam experience, Mock exam can also give you confidence to attend real exam.

The following is the syllabus for HDPCA exam along with tutorials.

Installation

Configure a local HDP repository

Install ambari-server and ambari-agent
Ambari server installation
Ambari agent installation

Install HDP using the Ambari install wizard

Add a new node to an existing cluster

Decommission a node
Decommissioning of node managers
Decommissioning of data nodes

Add an HDP service to a cluster using Ambari

Configuration

Define and deploy a rack topology script

Change the configuration of a service using Ambari

Modifying configurations to enable or disable ACLs

Configure the Capacity Scheduler

Create a home directory for a user and configure permissions

Configure the include and exclude DataNode files

Decommissioning of node managers
Decommissioning of data nodes

Troubleshooting

Restart an HDP service

View an application’s log file

Configure and manage alerts

Troubleshoot a failed job

Checking versions of Hadoop eco system tools

High Availability

Configure NameNode HA

Configure ResourceManager HA

Copy data between two clusters using distcp

Create a snapshot of an HDFS directory

Recover a snapshot

Configure HiveServer2 HA

Security

Install and configure Knox

Introduction to Knox
Installing and configuring Knox
WebHDFS REST API with knox and without knox

Install and configure Ranger

Install and configure Ranger
Creating HDFS policies in Ranger

Configure HDFS ACLs

Importing data from RDBMS to Hadoop using Apache Sqoop

In this article, We will learn how to import data from RDBMS to HDFS using Apache Sqoop. We will import data from Postgres to HDFS in this article.

1) Check postgres is running.

We will use the service command to check the status of Postgres. The command below checks the status of Postgres.

service PostgreSQL status

2) List databases;

We will connect to Postgres database and check the list of databases available in postgres. We can use psql command to start Postgres prompt and we can use \list command to display list of databases in postgres.

3) List tables in a database.

In this step, We will connect to one database and check tables in that dabatase. \c command will allow us to connect to a database in postgres.

We can use \dt command to display tables in current databases.

4) Check the data in the table.

We will select one table and check the data in that table before importing. We will import the data from table called TBLS in hive database, We will check the data in TBLS table using below select query .

select * from public."TBLS";

5) Importing data into HDFS using sqoop import command.

We use sqoop import command to transfer the data from RDBMS to postgres. We need to use below sqoop import options to import the data .

--connect : This option takes JDBC connection string of an RDBMS .

Syntax : jdbc:://:/

RDBMS-name : We need to specify RDBMS name here. If it is oracle , we specify oracle and if it is postgres , we will specify postgresql here.

database-hostname : We need to specify hostname or ip address where RDBMS is running.

dabatase-port : We need to use port number on which rdbms system is listening. Postgres's default port number is 5432.

database-name : We need to specify database name from RDBMS system from which we want to import the data.

--table : Takes table name from which we want to import the data .

--username : We need to specify user name of database .

--password : We need to specify password of database.

--target-dir : We need to specify HDFS directory where we want to import the data from RDBMS system.

--m : specifies number of parallel copies used to transfer the data from RDBMS system to HDFS.

We will run below command to tranfer the data from postgres table called TBLS to HDFS directory /user/hdfs/sqoopout/tbls .

sqoop import --connect jdbc:postgresql://192.168.1.113:5432/hive --table TBLS --username postgres --password postgrs --target-dir /user/hdfs/sqoopout/tbls -m 1

6) Check data in HDFS folder.

Now we will check the data in HDFS directory to ensure data is transferred successfully.

7) Output directory already exists error.

Directory path mentioned in --target-dir option will be created by sqoop import command. If directory already exists in HDFS, sqoop import command would fail with Output directory already exists error.

8)Delete directory if already exists in HDFS

We can use --delete-target-dir option to import data into HDFS directory even if directory exists in the HDFS. This --delete-target-dir option will remove existing directory and will create the diretcory again. We need to be extra careful while using this option as existing data will be removed.

Apache sqoop simplifies bi-directional data transfer between RDBMS systems and Apache Hadoop.

Let me know if you need any help on above commands.

Enabling namenode HA using Apache ambari

In this article we will learn how to enable high availability for name node. Nam node high availability has more than one name node. One of the name nodes will be active and it will responsible for serving user requests. Other namenodes will be in stand by mode, Standby namenodes will read meta data of active namenode continuously to be in sync with active namenode . If active namenode goes down , One of standby namenodes will become active to serve user requests that to without failing running jobs.

1) Confirm no HA enabled for Name node

By default, Hortonworks Data Platform setup will include namenode and Secondary namenode in HDFS service. In this scenario if namenode goes down , entire cluster will down and running jobs would fail. To address these issues ,We need to enable High availability (HA) for Name node.

Namenode HA will fail over to other name nodes automatically to avoid cluster down scenarios .

The picture below confirms we have namenode and Secondary namenode in the cluster.

We need to enable Namenode HA to have active namenode and standby namenode.

2) Click enable Namenode HA under service actions

Click Enable Namenode HA to enable HA for name node. This will open namenode HA wizard.

Go to HDFS -----> click Service Actions -------> click Enable Namenode HA

3) Getting started

We need to enter nameservice ID in the first step. Nameservice ID will resolve to active namenode automatically.

All hadoop clients should use nameservice ID rather than hard coding active namenode.

4) Select Hosts

Namenode HA wizard will install

additional namenode
3 journal nodes
2 Zookeeper failover controllers

In this step we need to select hosts for additional namenode and journal nodes.

5) Review

This step provides complete information about what wizard is going to install and what configurations wizard is going to add/modify.

We can go back and modify things at this step if we want. Click Next to go to next step.

6) Create checkpoint

In this step , Wizard asks us to perform two things.

entering namenode into safemode.
creating checkpoint for namenode.

**** Please note we need to run given commands only on specified node.

Once these commands are run successful, Next button will be enabled.

7) Configure components

This step performs

Stopping all services
Installing additional namenode on specified host
Installing journal nodes on specified hosts.
Modifying configurations with required properties for Namdenode HA.
Starting journal nodes
Disabling secondary namenode

Click next once all these operations are completed.

8) Initialize journal nodes

This step asks us to run initializeSharedEdits command on first master node.

Once command is run on specified node , click Next

9) Start components

This step performs two things.

Starts zookeeper servers
Starts Namenode

Click Next once two operations completed.

10) Initialize metadata

This step asks us to run two commands on two master nodes.

We need to run formatZK command on first master.
We need to run bootstrapStandby command on second master.

**** Please note we need to run given commands only on specified nodes.

11) Finalize HA setup

This step performs below things.

Starts additional namnode on specified node.
Install Zookeeper Failover controllers on two master nodes.
Starts Zookeeper Failover controllers on two master nodes.
Configures AMS
Deletes secondary namendoe as it is not required in namenode HA.
Stops HDFS
Starts all services

Click Done once above operations are completed.

12 ) Confirm name node HA is enabled

Apache Ambari reloads automatically after enabling HA for namenode, displays active namenode , standby namenode, journal nodes and Zookeeper failover controllers.

The picture below shows all of them.

Let me know if you are struck anywhere while enabling HA for namenode.

Enabling Resource manager HA using Ambari

In this article , We will learn how to enable Resource manager (RM) High availability (HA) using Apache Ambari. In resource manager high availability, the Hadoop cluster will have two or more resource managers. One resource manager will be active and other resource managers will be standby.

The active resource manager is responsible for serving user requests. if the active resource manager goes down, one of the Standby resource managers will become an active resource manager and will serve the user requests without failing running jobs.

By default HDP comes with a single resource manager without HA, We need to install one more resource manager and modify/add properties to enable Resource manager HA.

1) Confirm No HA is enabled

Go to Ambari home page ---> click on YARN --->Summary

If no HA is enabled for resource manager, You will see only one resource manager as shown in the below picture.

2) Click enable RM HA

Click on enable Resource Manager HA under service actions to initiate enabling RM HA.

Go to Ambari home page ------->click YARN-------> click Service actions -------> click enable Resource Manager HA

The picture below shows enable Resource Manager HA option.

3) Getting started

Ambari opens a new HA wizard that will walk us through enabling RM HA. We need downtime for cluster to enable HA. 1 hour down time is recommended .depending on the cluster size, You can plan for more down time.

This is information step , read and click next.

4) Select Host

We need to have one more resource manager for HA. We will select a node to install one more resource manager in this step. Click next once a node is selected.

In the picture below , I have selected node master2 to install one more resource manager.

5) Review

This is a review step. We can go back to previous step if we want to modify anything. The picture below show additional resource manager is going to be installed on master2 node.

We can go back and modify if we want to modify that node to something else. Else just click Next.

Some new properties need to be added/modified to YARN to enable resource manager HA. This step will show you all properties to be modified/added.

6) Configure components

This step performs 5 operations .

Amabri stops all required services to enable HA.

Installs new resource manager on selected node.

Adds/modifies YARN configuration for HA

Adds/modifies HDFS configurations for RM HA

Start all services.

You can click on operations to see logs.

Once all operations are completed, Click Next.

7) Confirm active and standby

Amabri reloads after completing RM HA , We can see two resource managers. One resource manager will be active and other will be standby.

The picture below shows two resource mangers and we can confirm RM HA is enabled.

Let me know if you have any questions about enabling RM HA using Ambari.

Log files in Hadoop eco system

In this article , We will learn how to check log files of hadoop daemons and how to read log files of applications and jobs.

1) Locate log directory in Apache Ambari

First We need to know log directory for hadoop daemons. We can use Apache ambari to find out log directory for a service. default log directory for any service is /var/log/. Many companies may not use default log directory , so it is better to know log directory using either Ambari or Cloudera Manager. We can even use Unix command to find log directories.

The picture below shows log directory for HDFS service.

Click on HDFS ----> Configs -------> type log in filter box.

The picture below shows how to locate log directory for Apache Oozie using grep command of Unix.

2) Types of log files

log directories will have three types of files .

.log

Logs of running daemons will be available here in this .log file.

The picture below shows logs of running active name node . tail -f command is used to see live logs of a daemon.

.out

.out file will have startup messages of a daemon. These messages will be useful to trouble shoot startup failures of a daemon.

The picture below show startup messages of active resource manegr.

.log.[date]

Old log files will have date in their name. by default log rotation is daily, so we will see one log per day.

The picture below shows old ranger log files have date in their name.

3) Command for applications logs

We have seen how to check logs of hadoop daemons, Now we will learn how to check logs of an application.

We need to have application ID to check logs of an application.

The picture below shows how to get logs of an application id application_1513741463894_0007.

Command used : yarn logs -applicationId application_1513741463894_0007

4) Command for job logs.

We can even check the logs of a hadoop job if we have job id.

The picture below shows how to job logs for job_id job_1513741463894_0001.

Command used : mapred job -logs [job_id]

5) Logs from Resource manager UI.

We can even get application logs from resource manager UI. go to Resource manager UI and click on application ID for which you want to check logs.

The picture below shows logs link in Resource manager UI for application application_1513741463894_0007.

Let me know if you have any questions on how to check log files for any service.

Technology

Search This Blog