Professional-Data-Engineer | What Actual Professional-Data-Engineer Prep Is

Guaranteed of Professional-Data-Engineer exam prep materials and samples for Google certification for IT candidates, Real Success Guaranteed with Updated Professional-Data-Engineer pdf dumps vce Materials. 100% PASS Google Professional Data Engineer Exam exam Today!

Check Professional-Data-Engineer free dumps before getting the full version:


You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

  • A. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.
  • B. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
  • C. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
  • D. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

Answer: D


Your neural network model is taking days to train. You want to increase the training speed. What can you do?

  • A. Subsample your test dataset.
  • B. Subsample your training dataset.
  • C. Increase the number of input features to your model.
  • D. Increase the number of layers in your neural network.

Answer: D



You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?

  • A. Use Cloud Bigtable for storag
  • B. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
  • C. Use Cloud Bigtable for storag
  • D. Link as permanent tables in BigQuery for query.
  • E. Use Cloud Storage for storag
  • F. Link as permanent tables in BigQuery for query.
  • G. Use Cloud Storage for storag
  • H. Link as temporary tables in BigQuery for query.

Answer: A


You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the shade of each dot represents what class it is. You want to classify this data accurately using a linear algorithm.
Professional-Data-Engineer dumps exhibit
To do this you need to add a synthetic feature. What should the value of that feature be?

  • A. X^2+Y^2
  • B. X^2
  • C. Y^2
  • D. cos(X)

Answer: D


You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?

  • A. Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
  • B. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
  • C. Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
  • D. Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage

Answer: A


Which of these operations can you perform from the BigQuery Web UI?

  • A. Upload a file in SQL format.
  • B. Load data with nested and repeated fields.
  • C. Upload a 20 MB file.
  • D. Upload multiple files using a wildcard.

Answer: B

You can load data with nested and repeated fields using the Web UI. You cannot use the Web UI to:
- Upload a file greater than 10 MB in size
- Upload multiple files at the same time
- Upload a file in SQL format
All three of the above operations can be performed using the "bq" command. Reference:


You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization. Which two actions can you take to increase performance of your pipeline? (Choose two.)

  • A. Increase the number of max workers
  • B. Use a larger instance type for your Cloud Dataflow workers
  • C. Change the zone of your Cloud Dataflow pipeline to run in us-central1
  • D. Create a temporary table in Cloud Bigtable that will act as a buffer for new dat
  • E. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Bigtable to BigQuery
  • F. Create a temporary table in Cloud Spanner that will act as a buffer for new dat
  • G. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery

Answer: BE


Cloud Bigtable is Google's Big Data database service.

  • A. Relational
  • B. mySQL
  • C. NoSQL
  • D. SQL Server

Answer: C

Cloud Bigtable is Google's NoSQL Big Data database service. It is the same database that Google uses for services, such as Search, Analytics, Maps, and Gmail.
It is used for requirements that are low latency and high throughput including Internet of Things (IoT), user analytics, and financial data analysis.


You are responsible for writing your company’s ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?

  • A. PigLatin using Pig
  • B. HiveQL using Hive
  • C. Java using MapReduce
  • D. Python using MapReduce

Answer: D


Your company’s customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?

  • A. Add a node to the MySQL cluster and build an OLAP cube there.
  • B. Use an ETL tool to load the data from MySQL into Google BigQuery.
  • C. Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
  • D. Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.

Answer: C


If you're running a performance test that depends upon Cloud Bigtable, all the choices except one below are recommended steps. Which is NOT a recommended step to follow?

  • A. Do not use a production instance.
  • B. Run your test for at least 10 minutes.
  • C. Before you test, run a heavy pre-test for several minutes.
  • D. Use at least 300 GB of data.

Answer: A

If you're running a performance test that depends upon Cloud Bigtable, be sure to follow these steps as you
plan and execute your test:
Use a production instance. A development instance will not give you an accurate sense of how a production instance performs under load.
Use at least 300 GB of data. Cloud Bigtable performs best with 1 TB or more of data. However, 300 GB of data is enough to provide reasonable results in a performance test on a 3-node cluster. On larger clusters, use 100 GB of data per node.
Before you test, run a heavy pre-test for several minutes. This step gives Cloud Bigtable a chance to balance data across your nodes based on the access patterns it observes.
Run your test for at least 10 minutes. This step lets Cloud Bigtable further optimize your data, and it helps ensure that you will test reads from disk as well as cached reads from memory.


As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects. Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies. Which two steps should you take? Choose 2 answers.

  • A. Use Cloud Deployment Manager to automate access provision.
  • B. Introduce resource hierarchy to leverage access control policy inheritance.
  • C. Create distinct groups for various teams, and specify groups in Cloud IAM policies.
  • D. Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
  • E. For each Cloud Storage bucket or BigQuery dataset, decide which projects need acces
  • F. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.

Answer: AC


How would you query specific partitions in a BigQuery table?

  • A. Use the DAY column in the WHERE clause
  • B. Use the EXTRACT(DAY) clause
  • C. Use the PARTITIONTIME pseudo-column in the WHERE clause
  • D. Use DATE BETWEEN in the WHERE clause

Answer: C

Partitioned tables include a pseudo column named _PARTITIONTIME that contains a date-based timestamp for data loaded into the table. To limit a query to particular partitions (such as Jan 1st and 2nd of 2017), use a clause similar to this:


You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

  • A. Deploy small Kafka clusters in your data centers to buffer events.
  • B. Have the data acquisition devices publish data to Cloud Pub/Sub.
  • C. Establish a Cloud Interconnect between all remote data centers and Google.
  • D. Write a Cloud Dataflow pipeline that aggregates all data in session windows.

Answer: A


Which of these is NOT a way to customize the software on Dataproc cluster instances?

  • A. Set initialization actions
  • B. Modify configuration files using cluster properties
  • C. Configure the cluster using Cloud Deployment Manager
  • D. Log into the master node and make changes from there

Answer: C

You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the --properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster.
When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. []


You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?

  • A. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing job name
  • B. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to a new unique job name
  • C. Stop the Cloud Dataflow pipeline with the Cancel optio
  • D. Create a new Cloud Dataflow job with the updated code
  • E. Stop the Cloud Dataflow pipeline with the Drain optio
  • F. Create a new Cloud Dataflow job with the updated code

Answer: A


Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

  • A. Create a Google Cloud Dataflow job to process the data.
  • B. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
  • C. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
  • D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
  • E. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.

Answer: A


You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?

  • A. BigQuery
  • B. Cloud Bigtable
  • C. Cloud Datastore
  • D. Cloud SQL for PostgreSQL

Answer: A


Which SQL keyword can be used to reduce the number of columns processed by BigQuery?

  • B. WHERE
  • D. LIMIT

Answer: C

SELECT allows you to query specific columns rather than the whole table.
LIMIT, BETWEEN, and WHERE clauses will not reduce the number of columns processed by BigQuery.


You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size. Which database should you choose?

  • A. Cloud SQL
  • B. Cloud Bigtable
  • C. Cloud Spanner
  • D. Cloud Datastore

Answer: A


Your company is selecting a system to centralize data ingestion and delivery. You are considering messaging and data integration systems to address the requirements. The key requirements are:
Professional-Data-Engineer dumps exhibit The ability to seek to a particular offset in a topic, possibly back to the start of all data ever captured
Professional-Data-Engineer dumps exhibit Support for publish/subscribe semantics on hundreds of topics
Professional-Data-Engineer dumps exhibit Retain per-key ordering Which system should you choose?

  • A. Apache Kafka
  • B. Cloud Storage
  • C. Cloud Pub/Sub
  • D. Firebase Cloud Messaging

Answer: A


What is the general recommendation when designing your row keys for a Cloud Bigtable schema?

  • A. Include multiple time series values within the row key
  • B. Keep the row keep as an 8 bit integer
  • C. Keep your row key reasonably short
  • D. Keep your row key as long as the field permits

Answer: C

A general guide is to, keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Cloud Bigtable server.


Thanks for reading the newest Professional-Data-Engineer exam dumps! We recommend you to try the PREMIUM Professional-Data-Engineer dumps in VCE and PDF here: (239 Q&As Dumps)