![]() There are three levels of filtering in our pipeline that use both of these techniques. Duplicate elimination is typically done either by looking up a record ID among a list of already processed IDs, or, if the record wasn't yet processed, by grouping unprocessed data elements by some key that represents uniqueness and picking one of these elements to represent the entire group. Add the following parameter to your pipeline:Īfter reading our inputs-either news articles from GDELT or posts from Reddit-we proceeded to eliminate duplicates from our data. Use the Cloud Dataflow Shuffle for better performance of your CoGroupByKey.Side Inputs could be a faster alternative to CoGroupByKey when joining datasets if one of the datasets can fit into the memory of a Dataflow worker VM.After the join, we iterated over the results to create a single entity that had both post and comment information. For each dataset in the join we created a key-value (KV) pair and then applied the CoGroupByKey operation to match the keys between the two datasets. We used the CoGroupByKey design pattern to implement this join. Write output records (opinions found in the text) to BigQuery and invalid inputs to BigtableĪs we were implementing the input reader for Reddit data, we had to join the posts and comments tables together.Here is the high-level design of the Dataflow pipeline that we used to bring both the GDELT and Reddit data over to BigQuery. Ultimately, we ended up with two BigQuery datasets that had the same schema, and that we could join with each other via the URL of the news article post using BiqQuery's SQL. We therefore wrote a Dataflow pipeline that can change its input source depending on a parameter we pass to the job, but does all the downstream processing in the same way. We wanted to run both the GDELT news articles and the Reddit posts/comments through a very similar processing pipeline, extracting subjects and sentiments from both of them, and storing the results in BigQuery. ![]() The other half of our training set (the labels) are Reddit posts and their comments, available in a BigQuery dataset in two tables: posts and comments. Let's dive deeper into how these patterns helped us assemble our training dataset.Īs we mentioned before, we sourced half of our training set (the features) from GDELT using files in gzipped JSON format. And, lastly, when dealing with invalid or malformed input data, we implemented a Cloud Bigtable sink that collected invalid input records according to the dead letter queue pattern. When performing data analysis, we relied on the GroupByKey pattern to implement composite keys with multiple properties. For joining the Reddit posts and comments, we used the CoGroupByKey pattern. For onboarding external data, we used an external service access pattern that allowed us to call to Cloud Natural Language API to enrich our dataset with additional subject information. We used patterns in almost every of the above categories. Orchestrating the execution of a pipeline.The top 10 design patterns for Dataflow can be roughly divided according to the lifecycle of a Dataflow pipeline: The rest of the article will explain in more detail how we applied these Dataflow design practices to the data problems we encountered in building our training dataset. As we were solidifying the design for our data preparation pipeline, we used several of these best practices to future-proof our opinion analysis infrastructure. There are GDELT and Reddit BigQuery datasets already, but we wanted to do a deeper sentiment analysis on the raw content of news articles and Reddit posts and comments, and for that we used the Dataflow Opinion Analysis project ( GitHub repo) we wrote about earlier.īack in June 2017 Reza Rokni and John LaBarge-two GCP solutions architects-shared their best practices for developing production-quality Dataflow pipelines in a two-part blog series here (1) and here (2). Our first task was to place the GDELT and Reddit data into a database that was easy to query. In this blog we will review the feature preprocessing infrastructure based on Cloud Dataflow and BigQuery that we built for these models. Part 2 of the series went into detail of developing and tuning these models. In Part 1 of our blog series we previewed TensorFlow models that combined traditional natural language processing (NLP) techniques such as entity and topic extraction, and deep learning techniques such as embeddings to predict with 92% accuracy which Reddit community (subreddit) a news article would land for discussion, and estimated the popularity score and volume of comments such post would generate.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |