Introduction

Data Stack

I graduated with a Bachelor of Engineering in Computer Science from Bach Khoa University (HCMUT) since November 2022. Before graduation, I interned as a Python-Odoo developer for 5 months. This was the first time I had the opportunity to work with real business data. My main task was to build a sales overview dashboard for the director and stakeholders. At that time, my knowledge about data and DevOps was quite vague. I didn’t know what a data pipeline was, how to get and transform data from a data source, or which fields, colors, etc., to use. I believe most students encounter similar challenges when starting their careers.

Today I’m 24, I will write about the data pipeline and overview of the data stack for data engineers. I would like to dedicate this post to myself 3 years ago.

What is data pipeline

Assume you have data from many sources (e.g., APIs, databases) and you want to collect, summarize, and store it in a single location. After that, you want to use it for analysis, visualization, machine learning, or other purposes. This entire process is called a data pipeline.

There are two main types of data pipelines:

Streaming processing pipelines

A data stream is a continuous, incremental sequence of small-sized data packets. It usually represents a series of events occurring over a given period. Streaming processing pipelines handle data in real-time, processing each data point as it arrives. This is useful for applications requiring real-time analytics, such as monitoring system logs or processing transactions.

Batch processing pipelines

Batch processing data pipelines process and store data in large volumes or batches. They are suitable for occasional high-volume tasks like monthly accounting or periodic reporting. Batch processing allows for handling significant amounts of data at once, making it ideal for tasks that do not require immediate real-time processing.

Basically, a full data pipeline will have the following steps:

  • Data ingestion: Collecting raw data from various sources such as databases, APIs, or streaming services.
  • ETL pipeline: Extracting, Transforming, and Loading data to prepare it for analysis. This involves cleaning and transforming the data into a usable format.
  • Data Analysis: Analyzing the processed data to extract insights and inform decision-making.
  • Data Visualization/Machine Learning: Creating visual representations of the data to communicate findings effectively or applying machine learning algorithms to build predictive models.

Data stack

To process tasks in a data pipeline, data engineers and data analysts need to learn some essential tools and technologies, collectively known as the data stack. Based on my experience, I have summarized them into several groups to facilitate easy research. Please note that you don’t need to be fully familiar with every tool in the stack. It’s beneficial to master one tool in each group first, then approach other tools more quickly as you become more proficient.

Okay, let’s get started.

1. Collection + Integration + Orchestration

  • Dagster: Orchestrator for data pipelines.
  • Airbyte: Data integration for moving data from various sources into your systems.
  • Apache Spark: Data processing engine that can also handle data ingestion when integrated with sources.
  • Apache Airflow: Workflow orchestration to manage and schedule data pipelines.
  • Kafka: Messaging system primarily for building real-time data pipelines and streaming apps.

2. Transformation

  • DBT (Data Build Tool): Focuses on the transformation of data within your data warehouses.
  • Databricks: Provides a unified analytics platform (combining data engineering, data science, and analytics), including data transformation capabilities.
  • Flink: Stream processing framework, but also suitable for stateful computations over data streams.

3. Database

  • PostgreSQL: Relational database system.
  • Oracle: Comprehensive relational database system.
  • Redis: In-memory data structure store, used as a database, cache, and message broker.
  • MongoDB: NoSQL database for handling document-oriented storage.

4. Data Warehouse

  • Snowflake: Cloud data platform and data warehouse.
  • ClickHouse: OLAP database management system for online analytical processing.
  • Databricks: Listed again under data warehouse for its ability to perform as a big data processing platform, including storage capabilities.
  • Google BigQuery: Serverless, highly scalable, and cost-effective cloud data warehouse.
  • Redshift: Data warehouse product which forms part of the larger cloud computing platform AWS.

5. Data Visualization

  • PowerBI
  • Tableau
  • Looker
  • Sisense

6. Data Catalog

  • Datahub: Data discovery and metadata platform.
  • Amundsen: Data discovery and metadata engine.

7. Data Quality Assurance

  • Great Expectations: Tool to help teams eliminate pipeline debt, through data testing, documentation, and profiling.

8. AI/Machine Learning

  • Langchain: Likely related to language models and applications.
  • OpenAI: AI research and deployment company.
  • Hugging Face: Provides a platform for building, training, and deploying machine learning models, particularly focused on natural language processing.
  • MLflow: Platform to manage the machine learning lifecycle, including experimentation, reproducibility, and deployment.
  • PyTorch: Machine learning framework that accelerates the path from research prototyping to production deployment.
  • TensorFlow: An end-to-end open-source platform for machine learning.

In addition to being familiar with technologies in data engineering, you also need to learn DevOps to handle tasks involving Docker or CI/CD.

9. Cloud Platform

  • AWS: Amazon Web Services.
  • GCP: Google Cloud Platform.
  • MS Azure: Microsoft Azure.

10. Deploying Data Pipeline in Production

  • Docker: Software platform that allows you to build, test, and deploy applications quickly.
  • Kubernetes: Open-source system for automating deployment, scaling, and management of containerized applications.

11. Automation and CI/CD for Data Pipelines

  • Version Control System: Git, github/bitbucket/gitlab
  • CI: Jenkins, Gitlab CI/CD, Github Actions, Bitbucket Pipelines,
  • CD: Teraform (IaC)
  • Containerization and Orchestration: Docker, K8S
  • Monitoring and Logging: Prometheus / Grafana
  • Testing and Data Validation: Great Expectations, pytest / unittest, Selenium

Conclusion

The technologies I mentioned above are only a small part of the vast data world. At times, you can get lost in the enormous expanse of knowledge and feel overwhelmed. I hope this post helps you find what you need in the data world.

In the next posts, we will practice detailed using some of these technologies in data engineering. See yah!