Labels

Wednesday, May 26, 2021

DP-900 Notes

 What is data?

Data is a collection of facts such as numbers, descriptions, and observations used in decision making. You can classify data as structured, semi-structured, or unstructured.

Rational database

Structured data is typically tabular data that is represented by rows and columns in a database. 
The main characteristics of a relational database are:
  • An entity is described as a thing about which information needs to be known or held. 
  • All data is tabular. Entities are modeled as tables, each instance of an entity is a row in the table, and each property is defined as a column.
  • All rows in the same table have the same set of columns.
  • A table can contain any number of rows.
  • A primary key uniquely identifies each row in a table. No two rows can share the same primary key.
  • A foreign key references rows in another, related table. For each value in the foreign key column, there should be a row with the same value in the corresponding primary key column in the other table.

non-relational system

you store the information for entities in collections or containers rather than relational tables. Two entities in the same collection can have a different set of fields rather than a regular set of columns found in a relational table. The lack of a fixed schema means that each entity must be self-describing. Often this is achieved by labeling each field with the name of the data that it represents.

Semi-structured data is information that doesn't reside in a relational database but still has some structure to it. Examples include documents held in JavaScript Object Notation (JSON) format, key-value stores and graph databases.

Azure Cosmos DB support indexing even when the structure of the indexed data can vary from record to record.

unstructured data like audio and video files, and binary data files, Azure Blob storage (Blob is an acronym for Binary Large Object).

key-value store - NoSQL
Azure Table storage is an example of a key-value store. Cosmos DB also implements a key-value store using the Table API.

Document database - JASON, Azure Cosmos DB

column family database
Apache Cassandra. Azure Cosmos DB supports the column-family approach through the Cassandra API. Azure Cosmos DB supports graph databases using the Gremlin API. The Gremlin API is a standard language for creating and querying graphs.

graph database
A graph database stores two types of information: nodes that you can think of as instances of entities, and edges, which specify the relationships between nodes.

levels of access

Read-only access means the users can read data but can't modify any existing data or create new data.
Read/write access gives users the ability to view and modify existing data.
Owner privilege gives full access to the data including managing the security like adding new users and removing access to existing users.

analytical systems

analytical system is designed to support business users who need to query data and gain a big picture view of the information held in a database. Analytical systems are concerned with capturing raw data, and using it to generate insights. 
data ingestion, data transformation, data querying, and data visualization.

Processing data as it arrives is called streaming. Buffering and processing the data in groups is called batch processing.

Data processing can be complex, and may involve automated scripts, and tools such as Azure Databricks, Azure Functions, and Azure Cognitive Services to examine and reformat the data, and generate models. A data analyst could use machine learning to help determine future trends based on these models.

Generally speaking, data analytics comes in four types (Figure 1):

Descriptive, to answer the question: What’s happening?
Diagnostic, to answer the question: Why’s happening?
Predictive, to answer the question: What will happen?
Prescriptive, to answer the question: What actions should we take?



transaction processing systems.

Transactional systems are often high-volume, support fast processing, 
normalization enables fast throughput for transactions, it can make querying more complex.
A transactional database must adhere to the ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure that the database remains consistent while processing transactions.






if you are storing data in Azure SQL Database, you can use SQL Server Integration Services. Integration Services can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations.

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows, or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

visualization options to represent data

Bar and column charts
Line charts: Line charts emphasize the overall shape of an entire series of values, usually over time.
Matrix: A matrix visual is a tabular structure that summarizes data. 
Key influencers: A key influencer chart displays the major contributors to a selected result or value. 
Treemap: Treemaps are charts of colored rectangles, with size representing the relative value of each item. 
Scatter: A scatter chart shows the relationship between two numerical values. A bubble chart is a scatter chart that replaces data points with bubbles, with the bubble size representing an additional third data dimension.
Filled map




Azure SQL Database - Single DatabaseElastic Pool, and Managed Instance
Low Maintenance
scalability 
high availability guarantees
Advanced threat protection provides advanced security capabilities
Auditing tracks database events and writes them to an audit log in your Azure storage account. 
SQL Database helps secure your data by providing encryption. 

neither Single Database nor Elastic Pool support linked servers.






Virtual local area networks (VLANs) and access control lists (ACLs) are used to restrict network communications by source and destination networks, protocols, and port numbers.

Items that implement network-based ACLs include routers and load balancers. You control traffic flow through these items by defining firewall rules.

Clients connect to a gateway that has a public IP address and listens on port 1433.
Connections to your Azure Database for PostgreSQL server communicate over port 5432. 
Connections to your Azure Database for MySQL server communicate over port 3306. 

You can use any of these tools to query data held in Azure SQL Database:
  • The query editor in the Azure portal
  • The sqlcmd utility from the command line or the Azure Cloud Shell
  • SQL Server Management Studio
  • Azure Data Studio
  • SQL Server Data Tools

non-relational data in Azure

Azure Table Storage implements the NoSQL key-value model. In this model, the data for an item is stored as a set of fields, and the item is identified by a unique key. The key in an Azure Table Storage table comprises two elements; the partition key that identifies the partition containing the row (as described above), and a row key that is unique to each row in the same partition. Items in the same partition are stored in row key order. 

Azure Blob storage is a service that enables you to store massive amounts of unstructured data, or blobs, in the cloud. 
  • Block blobs. A block blob is handled as a set of blocks. Each block can vary in size, up to 100 MB. A block blob can contain up to 50,000 blocks, giving a maximum size of over 4.7 TB. The block is the smallest amount of data that can be read or written as an individual unit. Block blobs are best used to store discrete, large, binary objects that change infrequently.

  • Page blobs. A page blob is organized as a collection of fixed size 512-byte pages. A page blob is optimized to support random read and write operations; you can fetch and store data for a single page if necessary. A page blob can hold up to 8 TB of data. Azure uses page blobs to implement virtual disk storage for virtual machines.

  • Append blobs. An append blob is a block blob optimized to support append operations. You can only add blocks to the end of an append blob; updating or deleting existing blocks isn't supported. Each block can vary in size, up to 4 MB. The maximum size of an append blob is just over 195 GB.

Azure File Storage enables you to create files shares in the cloud, and access these file shares from anywhere with an internet connection. Azure File Storage exposes file shares using the Server Message Block 3.0 (SMB) protocol. 

Azure Cosmos DB is a multi-model NoSQL database management system. Cosmos DB manages data as a partitioned set of documents. A document is a collection of fields, identified by a key. The fields in each document can vary, and a field can contain child documents. Many document databases use JSON (JavaScript Object Notation) to represent the document structure. A document can hold up to 2 MB of data, including small binary objects. If you need to store larger blobs as part of a document, use Azure Blob storage, and add a reference to the blob in the document.  Documents in a Cosmos DB database are organized into containers. The APIs that Cosmos DB currently supports include:
  • SQL API
  • Table API - Key/Value data
  • MongoDB API - JASON documents
  • Cassandra API - column oriented
  • Gremlin API - Graph data

 

Azure Table Storage

Azure Blob storage 

Azure File storage 

Azure Cosmos DB 

Data Lake store

Use case

NoSQL key-value model

large binary like Video and Audio

File Share

documents

Large Data

Size

500 TB limited by Storage account

individual entity Up to 1 MB, with a maximum of 255 properties. PartitionKey/RowKey string up to 1 KB in size.

An Azure storage account can hold up to 5 PB of data.

Block blobs – 4.7 TB. 100MB each block

Page blobs – 8TB. 512-byte pages

Append blobs – 195GB, each blob up to 4MB

100 TB of in a single storage account

single file is 1 TB

 

 

automatically allocates space in a container for your partitions, and each partition can grow up to 10 GB in size. 

 

availability

 

 

 

99.99%

 

Performance tiers

 

·        The Hot tier is the default.

·        The Cool tier. 

·        The Archive tier.

·        The Standard tier - HD

·        The Premium tier – SSD

 

Azure aims to provide up to 300 MB/second of throughput for a single Standard file share

less than 10-ms latencies for both reads (indexed) and writes at the 99th percentile,

 

one RU per second (RU/s) will support an application that reads a single 1-KB document each second. 

 

Data Replication

replicated three times within an Azure region. you can create tables in geo-redundant storage. 

 

replicated locally within a region, but can also be geo-replicated to a second region.

 

replicated within a single region.

 

Data encryption at rest

supported

 

Enabled by default

 

 

Data encryption in Transit

 

 

Can be enabled

 

 

Tools

 

 

AzCopy utility

Azure File Sync service

 

 


Provisioning is the act of running a series of tasks that a service provider, such as Azure Cosmos DB, performs to create and configure a service. 
  • The Azure portal
  • The Azure command-line interface (CLI)
  • Azure PowerShell.
  • Azure Resource Manager templates.

Provision Azure Cosmos DB

a throughput of one RU per second (RU/s) will support an application that reads a single 1-KB document each second. The minimum throughput you can allocate to a database or container is 400 RU/s. 

Azure Storage account

Standard storage accounts are based on hard disks. 
Premium storage uses solid-state drives

Account kind
General-purpose v2. You can use this type of storage account for blobs, files, queues, and tables, and is recommended for most scenarios that require Azure Storage. If you want to provision Azure Data Lake Storage, you should specify this account type.

General-purpose v1. This is a legacy account type for blobs, files, queues, and tables. Use general-purpose v2 accounts when possible.

BlockBlobStorage. The type of storage account is only available for premium accounts. You use this account type for block blobs and append blobs. It's recommended for scenarios with high transaction rates, or that use smaller objects, or require consistently low storage latency.

FileStorage. This type is also only available for premium accounts. You use it to create files-only storage accounts with premium performance characteristics. It's recommended for enterprise or high-performance scale applications. Use this type if you're creating an account to support File Storage.

BlobStorage. This is another legacy account type that can only hold blobs. Use general-purpose v2 accounts instead, when possible. You can use this account type for Azure Data Lake storage, but the General-purpose v2 account type is preferable.

To restrict connectivity, use the Networking page for a service. To limit connectivity, choose Selected networks. Three further sections will appear, labeled Virtual NetworkFirewall, and Exceptions.

Azure Private Endpoint is a network interface that connects you privately and securely to a service powered by Azure Private Link. Private Endpoint uses a private IP address from your VNet, effectively bringing the service into your VNet. 

A role assignment consists of three elements: a security principal, a role definition, and a scope.

  • security principal is an object that represents a user, group, service, or managed identity that is requesting access to Azure resources.

  • role definition, often abbreviated to role, is a collection of permissions. A role definition lists the operations that can be performed, such as read, write, and delete. Roles can be given high-level names, like owner, or specific names, like virtual machine reader. Azure includes several built-in roles that you can use, including:

    • Owner - Has full access to all resources including the right to delegate access to others.

    • Contributor - Can create and manage all types of Azure resources but can't grant access to others.

    • Reader- Can view existing Azure resources.

    • User Access Administrator - Lets you manage user access to Azure resources.

    You can also create your own custom roles. For detailed information, see Create or update Azure custom roles using the Azure portal on the Microsoft website.

  • scope lists the set of resources that the access applies to. When you assign a role, you can further limit the actions allowed by defining a scope. This is helpful if, for example, you want to make someone a Website Contributor, but only for one resource group.


Cosmos DB inconsistencies level
  • Eventual
  • Consistent Prefix
  • Session
  • Bounded Staleness
  • Strong
Cosmos DB supports several NoSQL models, including document stores, graph databases, key-value stores, and column family databases.

Documents in a Cosmos DB database are organized into containers. The documents in a container are grouped together into partitions. A partition holds a set of documents that share a common partition key. 

Perform data operations in Cosmos DB: 
Azure provides a collection of services you can use to build a data warehouse solution, including Azure Data Factory, Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Analysis Services. You can use tools such as Power BI to analyze and visualize the data, generating reports, charts, and dashboards.

Azure Data Ingestion tools: Azure Data Factory, PolyBase, SQL Server Integration Services, and Azure Databricks.
Azure Data Factory uses a number of different resources: linked services, datasets, and pipelines. 

The most common options for processing data in Azure include Azure Databricks, Azure Data Factory, Azure Synapse Analytics, and Azure Data Lake.

Azure Synapse Analytics is a generalized analytics service. You can select between two technologies to process data:
  • Transact-SQL
  • Spark



Azure Databricks is an analytics platform optimized for the Microsoft Azure cloud services platform. The processing engine is provided by Apache Spark. You write Databricks applications using a Notebook. A notebook contains a series of steps (cells), each of which contains a block of code. 

Azure HDInsight is a managed analytics service in the cloud. It's based on Apache Hadoop. HDInsight uses a clustered model. HDInsight stores data using Azure Data Lake storage. You can use HDInsight to analyze data using frameworks such as Hadoop Map/Reduce, Apache Spark, Apache Hive, Apache Kafka, Apache Storm, R, and more.
  • Hadoop Map/Reduce/ Apache Spark
  • Apache Hive provides interactive SQL-like facilities for querying, aggregating, and summarizing data. 
  • Apache Kafka is a clustered streaming service that can ingest data in real time. It's a highly scalable solution that offers publish and subscribe features.
  • Apache Storm is a scalable, fault tolerant platform for running real-time data processing applications. 
Azure Data Factory is a service that can ingest large amounts of raw, unorganized data from relational and non-relational systems, and convert this data into meaningful information. 

Azure Data Lake is a collection of analytics and storage services that you can combine to implement a big data solution. It comprises three main elements:
  • Data Lake Store
  • Data Lake Analytics - U-SQL
  • HDInsight