Google Cloud Architecture

Lisää tämän kaltaisia

Philosophy of Education

luonut Vy Ngo

SCUOLA Tassonomia Bloom Web

luonut Anna Rita Vizzari

Texas State University IT Objectives

luonut Matthew Jett Hall

ESTUDIO

luonut humberto garcia

Key Features

Activate alerts modalities about impacted assets

Integration with

Minimize false positives

Define ad-hoc response actions to treat events

Resume and analyze activity for deeper insight

Things are not istantaneuos

For service reliability ends, all parts should agree they are accurate measure of the user experience AND consider as primary driver for decision-making

Allowing applications to respond to user requests from the location that will provide the quickest response time.

Divided in Regions

Are changable

Generally pretty broad set of permission altogheter

Non supportano IPv6

Network connectivity

Homogenous (G's - G's)

VPC

VPC Peering

Communication happens through private IP

Each GCP network admin pairs their network with the other

Outside same organization

Shared VPC

The GCP are connected through an ad-hoc and private generated network

Whitin same organization

To create a dedicated, private connection between different GCP, both from same organization node or not

Hybrid (G's - Own)

Type

There is no SLA for peering

Connection is enstablished at the PoPs (GCP's Edge Points of Presence), where Google's network connects to the rest of the Internet via peering

it's an access service not only to GC infrastructure, but to the whole Google's services

Access: External IP

Carrier peeting

If direct is no option,several provider offer to work as a bridge

Capacity: Dipend on carrier

Enstabliscing a direct connection whit Google in a PoP

Requires: Connection in peering facility

Capacity: 10 Gbps / link

Interconnecting

Physical linking between networks points of access (PoA) whitin a data center

Shared

If Dedicated is no option, several provider offer to work as a bridge

Requires: Service provider

Capacity: 0.5 - 10 Gbps / connection

Dedicated

Install an owned router whitin a data center hosting a Google PoA and directly link them

Requires: Connection in colocation facility

Capacity: 10 - 100 Gbps (100 in beta) / link

VPN

Create IPsec(ured) tunneling through data encription performed by gateways at each networks ends

Access: Internal IP

Requires: On-prem Gateway installation

Capacity: 1-5-3 Gps / tunnel (scalable)

Aim

To connect on-prem network with G's Network, by creating a direct link between owned resources and GC resources

Internal IP is generated throug link-local BGP routes

Services

Load ballancing

Global

TCP proxy

SSL proxy

HTTP(S) Load Balancing

3.5 A network endpoint group (NEG) is a configuration object that specifies a group of back-end endpoints or services

Internet NEG

Specified by "FQDN:[port]" or "IP:[port]"

A single, hybrid connectivity pointing to Traffic Director services outside G Cloud

Zonal NEG

1 or more endpoints (VMs istances or containers)

Serverless NEG

Points to Cloud Run, App Engine, Cloud Functions services residing in the same region

Contains no end-points

3. Backend service

Priority is given to geographycal vicinity and, if available, based on session affinity

Runs healty checks and routes the request based on the required load balancing criteria

2. Target proxy / URL Map

Receives request from 1., and checks the request against an URL map for the most appropriate backend service

1. Global forward Rule

In case of HTTPS, the target proxy ned to hold a valid SSL certificate

IPv4/IPv6, scalable, require no prewarning, content base, Icross-regional

Regional

Internal HTTP(S)

Network TCP/UDP

Internal TCP/UDP

Managed Instance Group

Healty check managment

Criteria are

Unhealthy threshold

How many failed attempts are decisive

Healthy threshold

How many successful attempts are decisive

Time-out

How long to wait for a response

Check interval

How often to check whether an instance is healthy.

Creation and crsahing of instances from outside the group command are recovered

Autoscale is brought on based on group functional targets (CPU usage, load balancing, total capacity, budget, ...)

It's a group if identical (VMs) istances generatd by template and managed as a whole

Conecting networks to Goolge VPC

Peering

It's not covered by Google Service Level Agreement (SLA)

gives direct access to Google Network

Carrier Peering

Using a 3rd partner service provider, the costumer network can be peered to those Google products that are exposed through a public IP

Direct peering

Placing a router in the same data center as oa Google point of presence

Interconnect

Partner Interconnect

Useful if

The data connection needs are lower then 10GB/s

the data center cannot be reached by Dedicated Interconnect

Connectivity between an on-premises network and a VPC network through a supported service provider.

Dedicated Interconnect

If the connection topologies meet G's specifications, SLA is covered up to 99.99%

Highest uptimes possible

1 or more private direct, connecting onto Google

IPsec VPN protocol

con

bandwith reliability

Security concerns

pro

New subnet will be automatically added to the connection

A "secure Internet Protocol" generated to create a tunnel connection

Terminologies

Connections

Layer 3 Vs Layer 2

Layer 3 connections provide access to G Suite services, YouTube, and Google Cloud API's using public IP addresses

Layer 2 connections use a VLAN that pipes directly into your GCP environment providing connectivity to internal IP addresses

Shared Vs Dedicated

Dedicated connections provides a direct connection to Google's network

Shared connections for a connection to Google's network through a partner

Labels

Useful for

Further specification

Billing indexing

Inventory purposes

Comprises of

A value ('HomeVM','Amagis','testing','ThomasEdison','ABC01234'...)

A key ('name', 'company', 'aim','contact','cost center'...)

String attributes attacheable to a resource (up to 64 each)

APIs

Following modification to the app implementation will be performed without altering the interface.

In case of need, the versioning of the interface allows for legacy compatibility

It's an user-friendly interface that allows the user to program the application hiding the unnecessary complexities

Containers

The host machine it' s called "node", since it it one of many to host a replica (shared workload) or a copy (same workload, for redundancy)

From the host, it only requires a conitainer-supportive OS and a cointener runtime service

They joins scalability of the workload from PaaS with the flexibility (HW adaptability and HW adjustability) from IaaS

It can be deployed with just the required system specification (adaptability), and be redefined (adjustability) on demand, and it is coordinated whitin the cloud with just the required number of replicas of itself required to share actual workload (scalability), which are much quicker then an ordinary VM to be booted or stopped.

It's a dedicated portion of both the OS kernel and User memory, which are can therefore be quickly operative, as a sealed environment, through few system calls

From the host PoV is assimilable to a service, from the guest software PoV is a VM. Moreover, for the whole aplication (that they are built for) PoV, each guest software is a microservice.

Offerings

SaaS: Software as Service

Not installed on local PC but run on the cloud

PaaS: Platform as Service

What they use

Bind code to libraries that provide access to the infrastructure application needs.

More resources to be focused on application logic

IaaS: Infrastucture as Service

Users pay for:

What they allocate in advance

Simila to Data Centers

Provide

Network Capabilities

Storage

Raw Compute

Managed Infrastructure - Managed Services

Allows

Companies to concentrate on their goal rather than on maintaining tecnica linfrastructure

Serverless

Eliminate need for infrastructure management

Allows developers to concentrate on code

Interaction with Google Cloud

Cloud Console Mobile App

Alerts and incident management

Generate custom graphic showing metrics

Controll billing and set up alerts

Administer application deployed on App Engine

Start and Stop Cloud SQL instances

Start, stop, and use SSH to connnect to Computer Engine instances and see Logs

Application Programminf Interfaces (APIs)

Plenty of libraries for APIs coding in different languages

Within the Google Cloud Console there is the Google APIs Explorer

All Google Cloud services comes with an API so that is possible write code to control them

Cloud SDK and Cloud Shell

Cloud Shell

Includes Cloud SDK and other utilities fully available, updated and authenticated.

It's actually a Debian-based VM with persistent 5 GB home dir

Provides commmand-line access to cloud resources from browser

Cloud SDK

Set of tools usefull to manage resources and applications

....

bq - a command line tool for BigQuery

gsutil - provides access to Cloud Storage from cmd line

gcloud tool - the main command line interace

Google Cloud Console

The web-based Graphical User Interface

provides

SSH connection via browser

Find resources, check status, manage them, set budgets

Easily deploy, scale, and diagnoe prdouction issues

Functional Structure

The resource hierarchy relates with the policies

IAM Best Practices

When possibile use Identity Aware-Proxy (IAP) tool

On Service Accounts

Enstablish key rotation policies and methods

And audit keys with "serviceAccount.keys.list()" method

When creating one, use a clear explanatory name based on its purpose

Even better if a naming convention is enstablished

Be careful when granting "serviceaccountuser" role, since the account will receive all the permission granted to the service

On Policies

When possible grant roles to groups instead to individuals, then...

Control the ownership of the groups used in IAM policies

Audit membership of groups used in policies

Audit members of groups used in policies

Audit policies in Cloud Audi Logs: setiampolicy

Use "principle of least priviledge" when assign roles

Check the policy granted on each resources and make sure to understand their inheritance

Use projects to group resources that share the same trust boundary

Policies are inherited downward

Can be defined to all but the resource level, and for some services also to resource level

4 levels resources hierarchy

4. Resources

Each belongs to exactly one projects

3. Projects

Can have several owner and users

If it's under an organization node, its identity will automatically be an owner

Are the basis for the use of Google services

2. Folders

Require a Organization node

1. Organization

Either your organization has a Google Workspace domain, or you will need to create an identity by Cloud Identity

Monitoring

Integrated Observability Tools

Manage Incidents

With

SLO compliancy

Error Reporting - to assist developers work

Alerts - Automatically from data signal, end eventually directly to personnel in key role

Visulalize and Analyze

by using

Profiler - from running Apps

Snapshot Debug - from running Apps

Health Checks - from Services (uptime and latency when facing external sites)

Service Monitoring - fron Ervices (compliances with SLO) and Error alerts

Logs Explorer - from Logs

Metrics Explorer - from Signal Data

Dashboards - from Signal Data

Capturing signals

Are divided in

Trace - for Apps

Logs - for Apps, Services, Platform,

Categorized in

Service Logs: created by developers deploying code to Google Cloud.

Network Logs: Network and Security operations

NAT Gateway - capture information on NAT network connections and errors

Firewall Rules - allows to audit, verify, and analyze the effects of your firewall rules

VPC flows - records samples of VPC network flow and can be used for network monitoring, forensics, real-time security analysis, and expense optimization

Agent Logs: generated by a G's agent installed on AWS or G's Cloud VM instances to ingest the log these generate

Cloud Audit Logs: helps answer the question "Who did what, where, and when?"

Access Transparency - capture the actions Google personnel take when accessing your content

System events - non-human Google Cloud administrative actions that change the configuration of resources

Data access - tracks calls that read the configuration or metadata of resources and user-driven calls that create, modify, or read user-provided resource data

Admin activity - tracks configuration changes

Metrics - for Apps, Services, Platform, Microservices

Integrated in all G's Tools from the hardware layer up

Targets sets for the signals

SLAs - Service Level Agreements

To set alert threshold considertably higher then what defined as minimum

Include compensation in case of paying costumer if not respected

The minimum levels of service promised to provide AND what happens when not respected

Commitments made to the client that systems and applications will have only a certain amount of “down time”

SLOs - Service Level Objectives

To be something short of 100%, like 99.9% ("3 nines")

Have concrete, well-documented consequences in case of failure to meet the objectives

be S.M.A.R.T.

Time-bound (or it become ephimeral)

Relevant (to the defined goals)

Achiavable (realistic at the actual conditions)

Measurable

Specific (as in not subjective)

The target value for a monitored metric

SLIs - Service Level Indicators

It's suggested

as the ratio: # good events / # all valid events.

Should

Have a close linear relationship with the users' experience of that reliability

Are

Selected monitoring metrics that measure one aspect of a service's reliability

The 4 golden signals

Errors

# of dropped connections

Servers that fail liveness checks

# of stack traces

# of exceptions

# of failed requests

# of 400/500 HTTP codes

Wrong answers or incorrect content

Oftern arise when

a flaw, failure, or fault in a computer program or system causes it to produce incorrect or unexpected results, or behave in unintended ways

The momento to send out an alert

Service level objective violations

Configuration or capacity issues

They are importante because

They may indicate that something is failing

They are

Events that measure system failures or other issues

Saturation

# of users on the system

# of of available connections

Memory quota

Disk quota

% CPU utilization

% disk utilization

% cache utilization

% memory utilization

% thread pool utilization

Degrading performance as capacity is reached

It's an indicator of how full the service is

The residual avaiability of the most constrained resources

Traffic

To note

Is often a subjective measure depending on the application type

# of active connections

# of read ops

# of write ops

# of active requests

# of retrievals per second

# of transactions per second

# of concurrent sessions

Network I/O

# of requests for static vs. dynamic content

# of HTTP requests per second

User appreciation

infrasctructure spending

Capacity planning

It's important because

It’s an indicator of current system demand

how many requests are reaching the system

Latency

Some metrics

...

Time to complete data return

Time to first response

Transaction duration

Service response time

Query duration

# of request waiting for a tread

Page load latency

Can be related to

Measurement of system improvment

Capacity demand

Emerging issues

It's importante because

Directly affect the user experience

It measures

How long it takes a specific task to return a result

We need

Monitoring tools that help provide data crucial to debugging application functional and performance issues

Automated alerts. Even better option is to construct automated systems to handle as many alerts as possible so humans only have to look at the most critical issues

Dashboards to provide business intelligence so our DevOps personnel have the data they need

Our products to improve continually, and we need data we can receive from monitoring for this

Transparency is key to building trust

Is defined as: "Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times,and server lifetimes."

It's the fundation of production reliability - Site Reliability Engineering (SRE)

Security

Google's operational security layer

Software stringent development practice

Vulnerability Rewards Programs

provides libraries that prevent developers from introducing certain classes of security bugs

two-party review of new code

central source control

Employee Universal Second Factor use

Reducin insider risk

Aggressively limits and actively monitors the activities of employees

Intrusion Detection

Rules and machine intelligence

Internet Communication layer

Denial of Service Protection

Multi-tier and multi-layer protections

Sheer scale of the infrastructure is a first layer of protection

Google Front End ("GFE")

Protection against Denial of Service attacks

Best practise are always in place

Every TLS cconnections are ended by public-private key pair and an X.509 certificate

Storage service latyer

Encription at rest

Data in physical storages are encripted by centrally managed keys and tipically accesed by storage services

Google’s central identity service

On user demand, 2nd factor identification

Request of further info, based on risk factors

Service deployment layer

Google’s services communicate with each other using RPC (remote procedure call) calls

Whitin Data center: Hardware cryptographic accelerators - Ongoing

Between DataCenter: Encryption of all inter-service RPC communication - Already in place

HW 3 layers

Premisis security

Third-party Data Center

Limited access

Physical protections

Secure boot stack

base operating system image

kernel

bootloader

Cryptographic signatures over the BIOS

Infrastructure is custom designed

Security chips

Google Global Network Infastructure

Geografical structure

Allows for

Reduce distance between endpoints

Reduce redoundancy by selecting the deploying areas

Protection from localized event

Low Latency

Measures the time a package takes to reach destination

Durability

Availability

5 Locations

Australia

Asia

Noth America

South America

Europe

Designed for

Averaging more than 100 content caching nodes worldwide

High demand content is cached for quicker access

Lowest possible latencies

Highest possible throughput

Outsider

Open Source systems

Tensorflow

Open source software library for machine learning

Components

Pricing Structure

Monitoring and estimate calculation tools

Reports

Quotas

Designed to prevevnt over-usage due to malicious attack

Allocation quotas: resource limits

Rate quotas: reset periodically

Budget can be defined at billing account level or at project level

Notification alerts

Deliver per-second billing for its infrastructure-as-a-service

Online pricing calculator for cost estimates

Just some services, not all of them???

App Engine flexible environment VMs

Dataproc

Kubernets Engine

Compute Enginge

Customizable VMs to tailor resources workloads on pricing

discount applied if used for more than 25% of a month

Google Cloud Architecture