In this article, I will summarize some of my thoughts and lessons learned during my internship with Google this summer. Interestingly, compared to most posts of this kind, this post will not talk about any perks or life thing, mostly since it is 2020! So, let’s focus on experience and lessons about working itself.
Learn to unblock yourself
The most important soft skill is how to unblock yourself. It is important to unblock yourself actively esp. when your project’s tech stack is super cutting-edge and/or involves a lot of uncertainty. While searching or pondering can solve most of the problems, you will need others’ help sometimes. And this is especially tricky when the others are not your host(s). In summary, to get help effectively, try to improve visibility & lower barrier of issue for others to help. More concretely:
- Always try to submit a bug in the tracker with complete instructions and timely updates on your findings.
- Ping someone who own the service/code in chat, introduce yourself and your problem briefly and politely. If the other side doesn’t respond (I never met this situation actually), you can also shoot an email in case the other side is more of an email person.
- If needed, scheduling a short meeting can make sure that you get 100% attention of the other person at least during that couple of minutes. Most people will accept it and help you in real time.
- Ask questions: link to the original thing. Don’t assume people have the same context. Avoid using a lot of “it”.
- Note that while a big company (esp. like Google) can feel like a whole software industry with people working on totally different things, it is not the same like outside in which you can hardly get direct help. In Google, you can just reach out and get help from any Googler.
Coding (Software Engineering)
My main coding part is to design, develop and integrate a new capability inside an existing framework.
Why is it challenging? First, it is a cross-PA (Product Area) collaboration, so I need support and trust from both sides. To build trust with stakeholders, I need to have regular meetings for design decisions (with some draft proposals beforehand).
Also, I need to understand and work on top of a highly complex but young codebase. Luckily I had some prior experiences like that (Servo browser engine, Iris framework, Hadrian build system…to name a few), I didn’t find it too hard to get started. There are several points worth noting when you need to do a similar thing:
- Ask early: how do your team debug and test your code?
- Build a mind map actively based on code readings, then try to present your mind map with the problems you want to solve in mind to the code owners. Reach consensus and clear most confusions. Sometimes I also made this mind map concrete, e.g. in a diagram.
- Get hands dirty:
- Utilize code search tool to help you understand e.g.
- How is something used?
- What is its definition?
- How is it tested?
- Try to hook into a debugger to understand the key code path in action
- Utilize code search tool to help you understand e.g.
Also, reading a codebase is not the goal. The goal is to solve your problem using the knowledge about the codebase. In this internship, my problem can be quite generic, thus it requires architectural level thinking to decide where and how to change the code. There are a couple of things to consider:
- How will the new code interact with reader/writer of the code in future?
- How will the new API interact with the users?
- How will the API be extended & composed with future features?
After understanding the codebase, I started to propose some designs. I found that a few visual diagrams are invaluable for the stakeholders to understand your proposal. In some cases, you can use a very small code patch to show your idea as well.
With design in mind, I applied an iterative workflow to implement the functionality. First, I had a prototype working and end2end tested in the first month. Then, I will repetitively run the tests throughout the iterations until the design details are finalized and actually merged in the codebase (which took at least another 2 months until all reviews and approvals come into place). Note that some part of the implementation might start from a very partial and incorrect copy & paste from existing code — it is far from the end result but will get you started quicker than trying to write a thing from thin air.
Model training & Experiments
I trained a lot of models (some are only for validation, some are used in the final results).
So, why is training model challenging? Isn’t it just a one-click thing? While the process is certainly much better and continuing to improve, it still requires some work:
- The training is conducted on a Google-scale dataset using Google-only infra (e.g. TPU). The model can be quite large as well (e.g. there are 200M parameters in a BERT-based model).
- Thus, I need to understand the cost and resource limitations and address problems accordingly
- e.g. I need to do some ad-hoc but aggressive capacity planning due to the huge size of models. Also, I need to anticipate your space usage to avoid disk exhaustion failures.
- Due to the numerical and iterative nature of the process, it is hard to debug if your model doesn’t “look” right.
The training of models is actually one part of the entire experiment plan. The plan is basically about how to design and conduct experiments and how to do analysis to shows the point of this experiment scientifically (the most important point is certainly whether the new capability I implemented is useful).
The experiment design involves the considerations below:
- Understand the big picture: Write job configurations → Prepare datasets → Train and monitor models → Perform batch inference → Inference result distillation → Python data analysis → Spreadsheet reviews with stakeholders → Final presentation of key results
- Design metrics based on business needs
- What is the key metric to record and compare?
- What are all the metrics that we might be interested in?
- Note that metrics can be both performance related (accuracy, AUC PR etc.) and cost related (training time, size of model etc.)
- Design models & datasets to compare
- Note that the execution of this part — preparing datasets and training models — can have a large RTT (normally takes more than 1 day). Plan the execution early.
- Needs carefully checking of the numerical end2end results
- For example, for AUC PR metric, is it the “one-hot” type or not?
During the experiment execution, the most challenging part is to do end2end validation of numeral results. Basically with some preliminary results, you need to ask yourself:
- Does these numbers make sense?
- Are the metrics/… from different sources consistent with each other?
If there is any bug, some debug tips:
- Scope the related code, read through them line by line carefully. Don’t have assumptions. Carefully check things esp. those hardcoded (NOTE: Hardcoding is a trade-off. There is always something hardcoded in order to balance between progress and reproducibility).
- If your prototype is not designed as a fully-observable one, make it one based on your need to debug, e.g.
- Add logging
- Make it possible to divide, test and observe different parts in a pipeline
- Cut down the input size, speed up the iteration.
- Ask yourself: am I spending time meaningfully in troubleshooting?
- Also keep people who are most familiar with the related codebase in loop — they might provide quick insights into what might be wrong.
- Think about the backup plan — not all bugs can/should be fixed when there is a deadline to ship
Remote working is hard. There are many articles about this topic, but from my experience, I think there is one point that is especially important: Excessive communication. Some concrete tips are:
- Use comments in experimental code to communicate
- Note that some of the comments might only be useful during the review stage. Since with WFH, the code review will happen in an entirely async fashion, it is important to write more comments to explain your thoughts.
- Design doc
- Use more digital illustrations since you can’t work with your colleagues on a whiteboard anymore.
- Again, be more verbose — reduce the amount of back-and-force iterations in a async setting, which is expensive.
- Explicit action items: after each meeting or just during chatting, make sure that it is clear who should do what. When anything requires more than a few minutes, try to submit a ticket to track it.
Finally, some misc points. One thing is that each intern can expense three books, and one of the books I expensed is Software Engineering at Google. I think it is a great book for anyone with interest in software engineering topic.