Study Questions pt.1

What is the role of a data engineer?

  • Design, build, and maintain the infrastructure and architecture for collecting, storing, and processing large sets of data.

Explain the ETL process.

  • ETL stands for Extract, Transform, Load. It is a data integration process that involves extracting data from various sources, transforming it to fit operational needs, and loading it into a target database or data warehouse.

What are the differences between OLTP and OLAP?

  • OLTP (Online Transaction Processing) is designed for managing transactional data with a focus on insert, update, and delete operations. OLAP (Online Analytical Processing) is designed for analyzing large volumes of data and is optimized for read-heavy operations like querying and reporting.

What are the key components of a data pipeline?

  • Key components include data sources, data ingestion, data storage, data processing, data analysis, and data visualization.

What are the benefits of using a distributed file system like HDFS?

  • HDFS (Hadoop Distributed File System) provides high-throughput access to large datasets, fault tolerance, scalability, and the ability to store and process data across multiple machines.

Explain the concept of data lake and its use cases.

  • A data lake is a centralized repository that allows storage of structured and unstructured data at any scale such as HDFS, AWS S3. Use cases include big data analytics, machine learning, and data archiving.

What are some common data ingestion tools?

  • One of the most common tools is Apache Kafka

What is Apache Spark and why is it used in data engineering?

  • Apache Spark is a unified analytics engine for large-scale data processing, known for its speed, ease of use, and support for advanced analytics like machine learning and graph processing.

What is a data warehouse and how does it differ from a data lake?

  • A data warehouse is a centralized repository optimized for fast query performance and reporting. Example of Data warehouses - Hive, Snowflake, Redshift etc. It typically stores structured data, whereas a data lake can store both structured and unstructured data.