类别 全部 - cloud - data - agile - engineering

作者:Lev Merkulov 2 年以前

112

Data Lab.Hard Skills

The provided information outlines various hard skills and tools relevant to data engineering and corporate communication. It starts by highlighting the importance of agile methodologies and effective communication through platforms like MS Outlook, MS Teams, and Yammer.

Data Lab.Hard Skills

Data Lab.Hard Skills

Agile & Communication

Presentations
LaTeX Beamer
Google Presentations
MS PowerPoint
Diagramms
Figma
Mindomo
PlantUML
Miro
MS Visio
Gliffy
Draw.io
Corporate Social Network
Yammer
Communication
MS Outlook
MS Teams
Knowlege sharing
Confluence
Task trackers
Trello
Jira

IT Infrastructure

IaC
AWS CloudFormation
Terraform
Ansible
Mail
AWS Workmail
Yandex Mail
Google Gmail
Monitoring
Prometheus
Kibana
Grafana
Zabbix
DNS
WAN

SSL Certificates

Multi domain

Wildcard

OV

EV

DV

DNS Records

PTR

NX

MX

CNAME

AAA

A

Zones

Domain

Structure

Routing delegation

Registration

Server

Yandex DNS

Google Public DNS

AWS Route 53

LAN
Load Balancer
HAProxy
Citrix ADC
AWS Elastic Load Balancing
Nginx
Azure Traffic Manager
Authorization
AWS IAM
LDAP
MS Active Directory
Hosting & Serverless calculation

Containers

Container hosting

Kubernetes

Azure Databricks

AWS ECS

Docker

docker-compose

Windows

Selectel

SberCloud

Ya.Cloud

AWS Lightsail

AWS EC2

Azure Windows Server

Linux

Data Science

BI
Fine BI
Visiology
Apache Superset
Ya Datalens
Metabase
reDash
Spotfire
Tableau
Streamlit
Qlik View
Qlik Sense
MS Power BI
Data Mining

MS Excel

SQL

Plotly

MathPlotLib

Seaborn

polars

Pandas

NumPy

numba

Machine Learning
Reinforcement learning

Temporal difference

Dynamic programming

Monte Carlo

Bandit

Recommendation Systems

Collaborative Filtering

Computer Vision

PyTorchCV

timm

Pytesseract

SimpleITK

Mahotas

Detection / Segmentation

pytorch-toolbelt

segmentation-models

detectron2

Image manipulation

Scikit-Image

Pillow

OpenCV

NLP

Emdeddings

fasttext

Gensim

Quadrant

Faiss

HuggingFace

pyMorphy2

spaCy

Textblob

Razdel

Natasha

NLTK

Deep learning

Jax

Keras

PyTorch

PyTorch Lightning

TensorFlow

Classic ML

Gradient Boosting

XGBoost

CatBoost

LightGBM

Vowpal Wabbit

Scikit-Learn

Linear models

Lasso

Ridge

Logistic Regression

Clusterization

Agglomerative

DBSCAN

KMeans

EDA
sweetviz
pandas-profiling
Statistics

Statsmodel

Pingouin

SciPy

Software Engineering

Deploy & Code Maintanance
MLOps

Serving

FastAPI

https://fastapi.tiangolo.com/tutorial/first-steps/

flask

bentoml

Experiment tracking

MLFlow

https://www.mlflow.org/docs/latest/quickstart.html

ClearML

https://clear.ml/docs/latest/docs/getting_started/ds/ds_first_steps

Data tracking & Quality

pydantic

pandera

CML

DVC

https://dvc.org/doc/start

Code quality

pre-commit

Formatters

black

Testing

unittest

hypothesis

pytest

pytest-coverage

Linters

pycodestype

mypy

Flake8

wemake

Git

GitLab

Bitbucket

GitHub

CI / CD

Team City

Jenkins

Github actions

https://docs.github.com/en/actions/learn-github-actions

Gitlab CI CD

https://docs.gitlab.com/ee/ci/

Algorithms & Data Structures

Data Engineering

MQ
IBM MQ
AWS MQ
AWS SQS
Kafka
Rabbit
ETL

AWS DMS

AWS Glue

Informatica

Databricks

MS SSIS

MS Data Factory

TalenD

Pentaho DI

Languages

Go

R

https://stepik.org/course/497/syllabus

Java

Так себе курс, подойдет только, если вы совсем ничего не знаете о java

https://stepik.org/course/497/syllabus

Scala

Catz

Python

connectorX

ScraPy

dask

https://docs.dask.org/en/stable/10-minutes-to-dask.html

lxml

BeautifulSoup

bonobo

psycopg2 / 3

pandas

sqlalchemy

requests

Data Storage
Storage

Protocols

HTTP

WebDAV

S3

SCP

Использование на уровне юзера


Копирование с локальной тачки на ремоут тачку

scp local_path username@host:remote_path


Копирование с ремоут тачки на локальную

scp username@host:remote_path local_path

SFTP

FTP

Sber Disk

Ya Disk

MS OneDrive

MS Sharepoint

Minio S3

Установка minio

https://docs.min.io/docs/minio-docker-quickstart-guide.html

MS Blob Storage

Google Drive

AWS S3

NON - RDBMS

Arango DB

Neo4J

MariaDB

Google Firebase

AWS Dynamo DB

Apache Cassandra

Apache Ignite

MS Cosmos DB

MongoDB

Apache Hadoop

Для пользователя на самом деле достаточно знать 2 команды

hdfs -get remote local

hdfs -put local remote


Архитектура hdfs (Java Api изучать не надо)

https://stepik.org/lesson/15482/step/1?unit=4233

RDBMS

Distributed

On premise

Clickhouse

Apache Hive

Citus

ArenaData

YDB

Snowflake

Google BigQuery

MS Synapse

AWS Redshift

Classic

https://www.sql-ex.ru/

Oracle

MySQL

PostgreSQL

MS SQL Server

Курс по Базам Данных, которые читали в УрФУ 2022

https://www.youtube.com/playlist?list=PLuYsCpx95Allwadi6NMeUYjGg31g7UsPP

Pipeline Orchestration
Cloud

Azure Data Factory

Google Cloud Composer

AWS Step Functions

Tools

Metaflow

Dagster

https://docs.dagster.io/guides

Rundeck

Prefect

Luigi

https://luigi.readthedocs.io/en/stable/

Airflow

https://airflow.apache.org/docs/apache-airflow/stable/howto/index.html


Astronomer - очень полезный ресурс!

https://www.astronomer.io/guides/

Cron

https://ostechnix.com/a-beginners-guide-to-cron-jobs/