Aiandreev.Data Lab.Hard Skills

Data Engineering

Pipeline Orchestration

Tools

Cron

r

https://ostechnix.com/a-beginners-guide-to-cron-jobs/

Airflow [J]

r

https://airflow.apache.org/docs/apache-airflow/stable/howto/index.htmlAstronomer - очень полезный ресурс!https://www.astronomer.io/guides/

a

Luigi

r

https://luigi.readthedocs.io/en/stable/

Prefect

Rundeck

Dagster

r

https://docs.dagster.io/guides

Metaflow

Cloud

AWS Step Functions

Google Cloud Composer

Azure Data Factory

Data Storage

RDBMS

Classic

r

https://www.sql-ex.ru/

a

MS SQL Server [M]

r

Курс по Базам Данных, которые читали в УрФУ 2022 https://www.youtube.com/playlist?list=PLuYsCpx95Allwadi6NMeUYjGg31g7UsPP

PostgreSQL [M]

MySQL [J]

Oracle

Distributed

Cloud

AWS Redshift

MS Synapse

Google BigQuery

Snowflake

YDB

ArenaData

On premise

Citus

Apache Hive

Clickhouse^

NON - RDBMS^

Apache Hadoop

r

Для пользователя на самом деле достаточно знать 2 команды hdfs -get remote localhdfs -put local remoteАрхитектура hdfs (Java Api изучать не надо)https://stepik.org/lesson/15482/step/1?unit=4233

MongoDB

MS Cosmos DB

Apache Ignite

Apache Cassandra

AWS Dynamo DB

Google Firebase

MariaDB

Neo4J

Arango DB

Storage

Cloud

AWS S3

Google Drive

MS Blob Storage

Minio S3

r

Установка minio https://docs.min.io/docs/minio-docker-quickstart-guide.html

MS Sharepoint

MS OneDrive

Ya Disk

Sber Disk

Protocols

FTP

SFTP

SCP

r

Использование на уровне юзераКопирование с локальной тачки на ремоут тачкуscp local_path username@host:remote_pathКопирование с ремоут тачки на локальнуюscp username@host:remote_path local_path

S3

WebDAV

HTTP

ETL

Languages

Python

requests

sqlalchemy [J]

pandas [J]

psycopg2 / 3

bonobo

BeautifulSoup

lxml

dask

r

https://docs.dask.org/en/stable/10-minutes-to-dask.html

ScraPy

connectorX

Scala

Catz

Java

r

Так себе курс, подойдет только, если вы совсем ничего не знаете о javahttps://stepik.org/course/497/syllabus

R

r

https://stepik.org/course/497/syllabus

Go

Tools

Pentaho DI [J]^

TalenD

MS Data Factory

MS SSIS

Databricks

Informatica

AWS Glue

AWS DMS

MQ

Rabbit

Kafka

AWS SQS

AWS MQ

IBM MQ

Software Engineering

Algorithms & Data Structures [J]

Deploy & Code Maintanance

CI / CD

Gitlab CI CD

r

https://docs.gitlab.com/ee/ci/

Github actions

r

https://docs.github.com/en/actions/learn-github-actions

Jenkins

Team City

Git

GitHub [J]

Bitbucket [J]

GitLab [J]

Code quality

Linters

Python

Flake8

wemake

mypy

pycodestype

Testing

Python

pytest

pytest-coverage

hypothesis

unittest

Formatters

Python

black

pre-commit

MLOps

Data tracking & Quality

DVC

r

https://dvc.org/doc/start

CML

pandera

pydantic

Experiment tracking

ClearML

r

https://clear.ml/docs/latest/docs/getting_started/ds/ds_first_steps

MLFlow

r

https://www.mlflow.org/docs/latest/quickstart.html

Serving

bentoml

flask

FastAPI

r

https://fastapi.tiangolo.com/tutorial/first-steps/

Languages

Scala

Python [J]

Data Science

a

Statistics

Python

SciPy

Pingouin

Statsmodel

EDA

pandas-profiling

sweetviz

Machine Learning

a

Classic ML

a

Scikit-Learn

Clusterization

KMeans

DBSCAN

Agglomerative

Linear models

Logistic Regression

Ridge

Lasso

Vowpal Wabbit

Gradient Boosting

LightGBM

CatBoost

XGBoost

Deep learning

a

TensorFlow

PyTorch

PyTorch Lightning

Keras

Jax

NLP

NLTK

Natasha

Razdel

Textblob

spaCy

pyMorphy2

Emdeddings

HuggingFace

Faiss

Quadrant

Gensim

fasttext

Computer Vision

Image manipulation

OpenCV

Pillow

Scikit-Image

Detection / Segmentation

detectron2

segmentation-models

pytorch-toolbelt

Mahotas

SimpleITK

Pytesseract

PyTorchCV

timm

Recommendation Systems

Collaborative Filtering

Reinforcement learning

Bandit

Monte Carlo

Dynamic programming

Temporal difference

Data Mining

Languages

Python

NumPy [J]

numba

Pandas [J]

polars

MathPlotLib [J]

Seaborn [J]

Plotly [J]

R

Scala

Java

SQL [M]

Tools

MS Excel [M]

BI

MS Power BI

Qlik Sense [M]

Qlik View [M]

Streamlit

Tableau

Spotfire

reDash

Metabase

Ya Datalens

Apache Superset

Visiology

Fine BI

IT Infrastructure

Hosting & Serverless calculation

Cloud

Linux

Windows

Azure Windows Server

AWS EC2

AWS Lightsail

Ya.Cloud

SberCloud

Selectel

Containers

Docker

docker-compose

Container hosting

AWS ECS

Azure Databricks

Kubernetes

AWS Lightsail

On premise

Linux [J]

Windows [J]

Authorization

MS Active Directory

LDAP [J]

AWS IAM

Load Balancer

Azure Traffic Manager

Nginx

AWS Elastic Load Balancing

Citrix ADC

HAProxy

Kubernetes

DNS

LAN

WAN

Server

AWS Lightsail

AWS Route 53

Google Public DNS

Yandex DNS

Domain

Registration

Routing delegation

Structure

Zones

DNS Records

A

AAA

CNAME

MX

NX

PTR

SSL Certificates

DV

EV

OV

Wildcard

Multi domain

Monitoring

Zabbix

Grafana

Kibana

Prometheus

Mail

Google Gmail

Yandex Mail

AWS Workmail

IaC

Ansible [J]

Terraform

AWS CloudFormation

Agile & Communication

Task trackers

Jira [J]

Trello

Knowlege sharing

Confluence [M]

Communication

MS Teams [M]

MS Outlook [J]

Corporate Social Network

Yammer [J]

Diagramms

Draw.io [J]

Gliffy

MS Visio

Miro [J]

PlantUML

Mindomo [J]

Figma

Presentations

MS PowerPoint [J]

Google Presentations

LaTeX Beamer