Apache NiFi Interview Questions

A

In this article, we will look at Apache NiFi Interview Questions. Questions are of varying complexity but all are very important and you should know the answer to all these questions before going to an interview.

Apache NiFi Interview Questions

Apache NiFi Interview Questions and Answers

1. What is Apache NiFi?

Apache NiFi is enterprise integration and dataflow automation tool that allows sending, receiving, routing, transforming and modifying data as needed and all this can be automated and configurable. NiFi has the capability to connect multiple information systems and different types of sources and destinations like HTTP, FTP, HDFS, File System, Different databases etc.

2. What is MiNiFi?

MiNiFi is a subproject of Apache NiFi which is designed as a complementary data collection approach that supplements the core tenets of NiFi, focusing on the collection of data at the source of its creation. MiNiFi is designed to run directly at the source, that is why it is special importance is given to the low footprint and low resource consumption. MiNiFi is available in Java as well as C++ agents which are ~50MB and 3.2MB in size respectively.

3. What is the role of Apache NiFi in Big Data Ecosystem?

The main roles Apache NiFi is suitable for in BigData Ecosystem are:

  • Data acquisition and delivery.
  • Transforamtions of data.
  • Routing data from different source to destination.
  • Event processing.
  • End to end provenance.
  • Edge intelligence and bi-directional communication.

4. What are the main features of NiFi?

The main features of Apache NiFi are:

  • Highly Configurable: Apache NiFi is highly flexible in configurations and allows us to decide what kind of configuration we want. For example, some of the possibilities are:

    • Loss tolerant cs Guaranted delivery
    • Low latency vs High throughput
    • Dynamic prioritization
    • Flow can be modified at runtime
    • Back pressure
  • Designed for extention: We can build our own processors and controllers etc.
  • Secure

    • SSL, SSH, HTTPS, encrypted content etc.
    • Multi-tenant authorisation and internal authorisation/policy management
  • MiNiFi Subproject: Apache MiNiFi is a subproject of NiFi which reduces the footprint to approx. 40 MB only and is very useful when we need to run data pipelines in low resource environments.

5. What is Apache NiFi used for?

  • Reliable and secure tranfer of data between different systems.
  • Delivery of data from source to different destinations and platforms.
  • Enrichment and preparation of data:

    • Conversion between formats.
    • Extraction/Parsing.
    • Routing decisions.

6. What is a flowfile?

FlowFiles are the heart of NiFi and its dataflows. A FlowFile is a data record, which consists of a pointer to its content and attributes which support the content. The content is the pointer to the actual data which is being handled and the attributes are key-value pairs that act as a metadata for the flowfile. Some of the attributes of a flowfile are filename, UUID, MIME Type etc.

7. What are the component of flowfile?

A FlowFile is made up of two parts:

  1. Content: The content is a stream of bytes which contains a pointer to the actual data being processed in the dataflow and is transported from source to destiantion. Keep in mind flowfile itself does not contain the data, rather it is a pointer to the content data. The actual content will be in the Content Repository of NiFi.
  2. Attributes: The attributes are key-value pairs that are associated with the data and act as the metadata for the flowfile. These attributes are generally used to store values which actually provides context to the data. Some of the examples of attributes are filename, UUID,
    MIME Type, Flowfile creating time etc.

8. What is a processor?

NiFi processors are the building block and most commonly used components in NiFi. Processors are the blocks which we drag and drop on the canvas and dataflwos are made up of multiple processors. A processor can be used for bringing data into the system like GetHTTPS, GetFile, ConsumeKafka etc. or can be used for performing some kind of data transformation or enrichment, for instance, SplitJSON, ConvertAvroToOrc, ReplaceText, ExecuteScript etc.

9. Do NiFi and Kafka overlap in functionality?

This is very common questions. Apache NiFi and Kafka actually are very complementary solutions. A Kafka broker provides a very low latency especially when we have a large number of consumers pulling from the same topic. Apache Kafka provides data pipelines and low latency, however Kafka is not designed to solve dataflow challenges i.e. data prioritization and enrichment etc. That is what Apache NiFi is designed for, it helps in designing dataflow pipelines which can perform data prioritization and other transformations when moving data from one system to another.

Furthermore, unlike NiFi, which handles messages with arbitrary sizes, Kafka prefers smaller messages, in the KB to MB range while NiFi is more flexible for varying sizes which can go upto GB per file or even more.

Apache NiFi is complementary to Apache Kafka by solving all the dataflow problems for Kafka.

10. While configuring a processor, what is the language syntax or formulas used?

NiFi has a concept called expression language which is supported on a per property basis, meaning the developer of the processor can choose whether a property supports expression language or not.

11. Is there a programming language that Apache NiFi supports?

Apache NiFi is implemented in Java programming language and allows for extensions to be implemented in Java. In addition, NiFi supports processors that execute scripts written in Groovy, Jython and several other scripting languages.

12. Can we schedule the flow to auto run like one would with coordinator?

Bz default, the processors are already continuously running as Apache NiFi is designed to be working on the principle of continuous streaming. Unless we select to only run a processor on an hourly or daily basis for example. But by design Apache NiFi is not a job oriented thing. Once we start a processor, it runs continuously.

13. How can we decide between NiFi vs Flume cs Sqoop?

NiFi supports all use cases that Flume supports and also have Flume processor out of the box.

NiFi also supports some similar capabilities of Sqoop. For example, GenerateTableFetch processor which does incremental fetch and parallel fetch against source table partitions.

Ultimately, what we want to look at is whether we are solving a specific or singular use case. IF so, then any one of the tools will work. NiFi’s benefits will really shine when we consider multiple use cases being handled at once and critical flow management features like interactive, real-time command and control with full data provenance.

14. What happens to data if NiFi goes down?

NiFi stores the data in the repository as it is traversing through the system. There are 3 key repositories:

  1. The flowfile repository.
  2. The content repository.
  3. The provenance reposiroty.

As a processor writes data to a flowfile, that is streamed directly to the content repository, when the processor finishes, it commits the session. This triggers the provenance repository to be updated to include the events that occurred for that processor and then the flowfile repository is updated to keep track of where in the flow the file is. Finally, the flowfile can be moved to the next queue in the flow. This way, if NiFi goes down at any point, it will be able to resume where it left off. This, however, glosses over one detail, which is that by default when we update the repositories, we write the into to repository but this is often cached by the OS. In case of any failure, this cached data might be lost if the OS also fails along with NiFi. If we really want to avoid this caching we can configure the repositories in the nifi.properties file to always sync to disk. This, however, can be a significant hindrance to performance. If only NiFi does down this not be problematic in any way to data, as OS will still be responsible for flushing that cached data to the disk.

15. If no prioritizers are set in a processor, what prioritization scheme is used?

The default prioritization scheme is said to be undefined, and it may change from time to time. If no prioritizers are set, the processor will sort the data based on the FlowFile’s Content Claim. This way, it provides the most efficient reading of the data and the highest throughput. We have discussed changing the default setting to First In First Out, but right now it is based on what gives the best performance.

These are some of the most commonly used interview questions regarding Apache NiFi. To read more about Apache NiFi you can check the category Apache NiFi and please do subscribe to the newsletter for more realted articles.

About the author

Ramaninder Singh

Ramaninder is working as a Big Data Engineer and is an avid follower of latest technologies, especially in Big Data Ecosystem. He graduated from Georg-August University, German and holds M.Sc in Applied Computer Science with specialisation in Applied Systems Engineering and minor in Business Informatics. He is also a Microsoft Certified Professional with more than 7 years of experience in Java, C#, Big Data, Web development and related technologies.

Add Comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By Ramaninder Singh

Ramaninder Singh

Ramaninder is working as a Big Data Engineer and is an avid follower of latest technologies, especially in Big Data Ecosystem. He graduated from Georg-August University, German and holds M.Sc in Applied Computer Science with specialisation in Applied Systems Engineering and minor in Business Informatics. He is also a Microsoft Certified Professional with more than 7 years of experience in Java, C#, Big Data, Web development and related technologies.

Get in touch

Quickly communicate covalent niche markets for maintainable sources. Collaboratively harness resource sucking experiences whereas cost effective meta-services.