This article introduces Fedlearner, a federated learning platform developed by the ByteDance federated learning technical team, shares the technical implementation and application experience of the platform, and shows readers the breakthrough point of how ByteDance implements federated learning.
As a distributed machine learning paradigm, federated learning can effectively solve the problem of data silos, allowing participants to jointly model and mine data value without sharing data.
In the past two years, federated learning technology has developed rapidly, and Alibaba, Tencent, Baidu, JD.com, Ant Financial, WeBank and other major Internet companies and financial technology companies have all set foot in it. Not long ago, the ByteDance federated learning technology team also open-sourced Fedlearner, a self-developed federated learning platform.
According to reports, Fedlearner, the ByteDance federated learning platform, has been practically applied in multiple landing scenarios in e-commerce, finance, education and other industries. Wu Di, head of ByteDance’s federated learning technology, said in an exclusive interview with InfoQ that the difficulty faced by federated learning is more about how to obtain the greatest perceived business value for customers. Partners in different industries have different product features and value appeals.
Thanks to ByteDance’s long-term accumulation of machine learning modeling technology in the field of recommendation and advertising, ByteDance federated learning has found a direction to help corporate customers achieve perceived business value, that is, ByteDance-based personalized recommendation algorithms and models Advantages, explore and find landing scenarios. For example, in the landing case of e-commerce advertising scenarios, Fedlearner has helped partners achieve more than 10% increase in delivery efficiency, 15%+ increase in traffic consumption, and 20%+ increase in e-commerce platform ROI.
In addition to the e-commerce industry, in the process of jointly exploring the implementation of federated learning with leading customers in the education industry, the effect has also been well confirmed, helping education customers to increase the number of advertisements by 124.73%, and the number of regular-priced courses to continue to increase by 211.54%. The renewal rate increased by 32.69%, and the customer acquisition cost of regular-priced course renewal users decreased by 11.73%.
Even so, from the perspective of the entire industry, there are still difficulties in the implementation of federated learning technology. In terms of the consideration and balance of safety and efficiency, in terms of modeling capabilities and the evolution of machine learning algorithms, there are still things that platforms and enterprises need to do. a lot of.
In this article, we will share the technical implementation and application experience of ByteDance’s federated learning platform Fedlearner, and see how ByteDance finds a breakthrough in the implementation of federated learning.
1. The technical implementation and challenges of the federated learning platform Fedlearner
The ByteDance federated learning team released the self-developed federated learning platform Fedlearner (project address: https://github.com/bytedance/fedlearner) in a low-key manner in early 2020, and has been continuously updated. It was launched on October 26, 2020. v1.5 version.
Wu Di told InfoQ: “The reason why Fedlearner is open source is that ByteDance has a huge amount of user data, and is well aware of the importance of protecting user data security. I hope that through open source Fedlearner, we can work with industry partners to promote the development of privacy computing, and cooperate with customers. Protect user data security together. On the other hand, while protecting user data security, we also establish an open and transparent mechanism on our platform to enhance customer trust.”
Fedlearner federated learning platform can support multiple types of federated learning modes. The entire system includes modules such as console, trainer, data processing, and data storage. Each module is symmetrically deployed on the clusters of both parties participating in the federation, and communicates with each other through agents to achieve training. .
Before Fedlearner initiates training, both parties must conduct an intersection based on the data of both parties, and find the intersection to realize model training. Usually, there are two ways to intersect the training data, one is the streaming data intersection, and the other is the PSI data intersection.
Intersect streaming data
Streaming data usually refers to data generated by common online traffic. For example, in an advertising scenario, when a user clicks an advertisement once, a data record will be generated on both the media platform and the advertiser. In order to jointly train a model based on Fedlearner, the two data records must first be aligned and a sample generated.
However, in the case of streaming data, the data placement time and the reliability of sample storage cannot be unified, and there will be problems of missing samples and inconsistent sample order on both sides. Therefore, there must be a set of protocols to take the intersection of the samples of both parties according to the sample example_id and sort them uniformly, and calculate the sample entries shared by both parties for model training.
As mentioned above, most of the current application scenarios of Fedlearner need to process large-scale data. For this reason, in streaming data processing, Fedlearner divides the data into N partitions in advance according to the example_id hash. When asking for intersection, each side pulls up N workers, and after pairing, N pairs are formed, and each pair handles one partition. In each pair of workers, the worker serving as the leader sends the example_id in its own data stream to the follower in sequence. After the follower receives it, it seeks an intersection with its own local data stream, and then sends the intersection back to the leader.
In the actual processing process, the data processing flow of the two parties conducting federated learning training is usually different, and the order of storage and calling of samples on both sides is also different. Therefore, when seeking intersection, the key-value query mechanism is usually used. This query mechanism is to randomly access the full amount of data for query, but the cost of random access to the full amount of data is too high. The cost is very high.
In order to solve this problem, Fedlearner adopts the time window method, that is, it maintains samples with similar time on both sides in memory, and discards a small number of samples that exceed the time window, thereby greatly reducing equipment and operation and maintenance costs.
PSI data intersection
Different from streaming data, there are also data in some scenarios that are not generated by common online traffic, but are recorded by each party independently, such as user portrait data recorded by different financial institutions.
For such data recorded by each party independently, before model training, it is necessary to use the user information (such as user ID) shared by both parties to find the intersection of the users of both parties. For example, institution A has 200 million user data, and institution B has 400 million user data. If you want to find the 150 million users shared by institutions A and B, you need to find the same user ID in both institutions, so as to find out the two parties. 150 million users of the intersection.
However, in the process of finding the intersection between the two parties, Institution A does not want to disclose the 50 million users of its own users who do not intersect with Institution B to the other party. Similarly, Institution B does not want to disclose 2.5 million users of its own users who do not intersect with Institution A. billion users leaked to each other.
To this end, the Fedlearner team adopted the method of PSI (Private Set Intersection) encrypted data intersection.
The PSI encrypted data intersection method allows two parties holding respective sets to jointly compute the intersection of the two sets. In this process, the real user IDs of both parties will be encrypted and hidden, whether during transmission or calculation, so that at the end of the calculation interaction, one or both parties can get the correct intersection, and will not get Any information in the other party’s set other than the intersection.
Due to the need to deal with data stored separately offline, the way of intersecting PSI encrypted data needs to be divided into two steps. The first is to securely encrypt the full IDs of both parties in an offline scenario. The basic process is as follows:
bytes will be fullraw byte ID Encrypt and hash with the private key, putraw byte ID map toHash encrypted byte ID into the database;
Customers will be full or incrementaloriginal customer id Add blind, transmit to byte;
Byte encrypts the received blinded data with the private key, and returns the encrypted result to the client;
The client deblinds and hashes the received encrypted and blinded data,Hash encrypted customer ID map backoriginal customer id into the database.
After the offline scene is completed, the two parties can conduct real-time negotiation in the online scene:
bytes will be real-time trafficraw byte ID query outHash encrypted byte IDtransmitted to the client;
customer receivesHash encrypted byte ID with the databaseHash encrypted customer ID Find the intersection, the intersection can be mapped back tooriginal common id;
The customer decides whether to send back and what information to send back according to the business.
Federated learning techniques are essentially designed to help both sides of the federation train models better. Different companies will adopt different model training methods due to different data types and characteristics accumulated before, and different application scenarios.
Common model training includes neural network model training, tree model training, linear regression model training, and so on. Among them, the most common application scenario of neural network training model is recommendation, including personalized content recommendation, advertisement recommendation, etc., while tree model is more commonly used in the financial field to model indicators such as credit/risk.
Because ByteDance has accumulated a lot of personalized recommendation technology, Fedlearner’s model training is also mainly based on neural network model training and tree model training.
Neural network model training
According to different feature distributions, the neural network model training of federated learning can be divided into two modes: vertical (Cross-silo) and horizontal (Cross-device). In the vertical mode, each participant has the same sample features of different dimensions, and the model is divided into two parts, similar to the parallel training of the model. In the horizontal mode, each participant has features of the same dimension of different samples, and each participant has a copy of the model, similar to data parallel training.
Both vertical and horizontal training modes can be reduced to a framework, that is, a pair of workers each executes a neural network and communicates intermediate results and gradients. To support this mode, Fedlearner implements a gRPC-based communication protocol and integrates into TensorFlow in the form of operators:
Using this communication protocol, algorithm engineers can change it to a model that supports federated training as long as they add a sending operator (send_op) and a receiving operator (receive_op) to the normal TensorFlow model code.
Based on neural network model training, Fedlearner currently handles up to 40TB of data. In an unstable public network communication environment, it can quickly and stably complete training data intersection and model training with very few resources, with high iteration efficiency.
tree model training
Different from neural network model training, tree model training currently only supports two longitudinal learning modes, one provides features, and the other provides labels and features. In order to perform model training on the premise of protecting the privacy of both data and labels, and at the same time, the performance must meet the actual use requirements, the trainer of the Fedlearner tree model adopts the SecureBoost algorithm:
The party with the label calculates the gradient of each sample and encrypts the gradient using the Paillier encryption algorithm, and then sends it to the other party;
The other party uses the semi-homomorphic property of Paillier encryption to calculate the gradient statistics of buckets by eigenvalues in the ciphertext field, and then returns it;
Finally, the party with the label decrypts the statistical value and finds the optimal split point.
Judging from the results of model training, the data of ByteDance’s federated learning platform Fedlearner has been comparable to or even surpassed by other competitors in the industry.
Previously, the ByteDance federated learning team compared the performance of the tree models on Fedlearner and WeBank’s FATE on a single machine. The training indicators based on the MINST data set show that when the parameter settings on both sides are the same, the model accuracy, F1, AUC, KS and other indicators after training are basically the same, and the difference appears after three decimal places.
Training metrics on the MINST dataset.
The main difference between the two sides appears in the model training speed. After 5 rounds of iterative average training, the training speed of Fedlearner has an advantage of about 17.2% compared to FATE.
One-click deployment, visualization platform
To handle large-scale data, Fedlearner relies on distributed computing and storage systems for data intersection and model training. To this end, the ByteDance federated learning team has developed a solution based on Kubernetes+HDFS/MySQL/Elasticsearch, where Kubernetes manages clusters and tasks. Among them, on public clouds such as Alibaba Cloud, users can use one-click scripts to quickly build an entire cluster.
The entire system is deployed using the K8s Helm Charts system integration. Users can use Helm to deploy the entire system with one click as long as a standard K8s cluster and HDFS/MySQL services are prepared.
After the system is deployed, the K8s cluster can automatically schedule federated learning tasks. Each Fedlearner task needs to pull up a K8s task on both parties involved. The workers of the two tasks need to be paired to communicate with each other. The ByteDance federated learning team customizes the K8s task. Controller and Ingress-NGINX implement pairing and encryption for cross-machine room communication.
Since K8s itself only provides command line + YAML management tasks, in order to facilitate the use of algorithm engineers, the ByteDance federated learning team has developed a visual web platform, where users can submit tasks, view progress, and analyze results.
Fedlearner tasks need to be launched simultaneously by both parties. In order to reduce the cost of communication between the two parties, the ByteDance Federated Learning team has also developed a Ticket-based pre-authorization system. As long as one party creates a Ticket to specify the scope of authorization, the other party can initiate multiple training and training sessions independently. prediction task.
2. The application of Fedlearner: 209% increase in advertising efficiency
At present, within ByteDance, Fedlearner has connected data scattered in various departments and businesses through technology empowerment, and helps businesses mine the value of data in various business scenarios on the premise of protecting the privacy and security of user and company data. In the external market, ByteDance is also promoting Fedlearner to land in various vertical industries such as e-commerce, finance, and education, and cooperate with leading customers in the industry to verify technical precipitation and commercial benefits.
“The biggest challenge is how to strive for the greatest perceived business value for customers. Fedlearner provides a good security guarantee and a complete federated learning ecosystem, which provides a solid foundation for deployment.” Wu Di said, “But from the ‘technical foundation’ ‘ to the ‘final commercial value increment’, there is still a long way to go.”
At present, there are many open federated learning platforms on the market, and each platform has different data advantages. Wu Di believes that enterprises need to choose the appropriate federated learning platform according to their business goals.
For Fedlearner, in addition to the technical training speed and efficiency, the biggest advantage is undoubtedly the machine learning modeling technology that ByteDance has accumulated for a long time in the field of recommendation and advertising. For example, in the e-commerce advertising scenario, e-commerce advertisers will place commodity advertisements on ByteDance’s huge engine platform. Both the huge engine and e-commerce advertisers hope to increase the ROI of advertising and jointly optimize the deep conversion, that is, the purchase event. However, e-commerce advertisers usually do not want to send sensitive information such as product details and user purchase history back to the massive engine platform. In this case, the ability of Fedlearner is reflected.
In this scenario, the implementation process of the federated learning platform can be roughly divided into two parts.
The first is the online part of the process:
·When users visit Douyin, Toutiao and other product platforms supported by Juda Engine, the Juda Engine platform will use the CTR/CVR model to sort to find the advertisement with the highest click rate/conversion rate and Display it to the user;
·After the user clicks on the advertisement, it will jump to the shopping page on the side of the e-commerce advertiser. At the same time, the e-commerce advertiser and the giant engine platform will record the click event, and both sides are marked with the same example_id;
·On the shopping page on the e-commerce advertiser’s side, users may choose to purchase (convert) or not purchase related products, and the advertiser will record the user’s behavior as a label.
Offline part of the process:
·E-commerce advertisers and giant engine platforms use the online recorded example_id to align the data and label, and then read the data in the order of alignment;
·The model is divided into two halves, the massive engine platform inputs the data into the first half, gets the intermediate result (embedding) and sends it to the e-commerce advertiser;
·E-commerce advertisers calculate the second half of the model, then use the labels recorded by themselves to calculate the loss and gradient, and then return the gradient to the huge engine platform;
·Finally, e-commerce advertisers and huge engine platforms update their models.
Based on the above solutions, combined with the user content interest tags on the side of the giant engine platform and the user transaction behavior and commodity tags on the e-commerce advertiser side, the Fedlearner federated learning platform is used for joint modeling, which can give full play to the complementarity of the data of both sides and optimize the Multiple modules such as advertisement recall, refined CTR, and CVR models.
In actual cooperation cases, it has achieved an increase in delivery efficiency of more than 10%, and assisted in improving the volume and ROI.
“In addition to the technical advantages, high-quality data is very important for this kind of implementation and application.” Wu Di said, “Training data is the core cornerstone of machine learning, and it is no exception in the federated learning scenario. Selection and business goals Highly relevant and expressive training data can accelerate the acquisition of extreme business value.”
The massive engine of ByteDance’s marketing platform, relying on the data advantages of Toutiao and Douyin, and based on 600T+ massive user group portraits, dynamically analyzes and in-depth modeling of user behavior characteristics, and has more than 2.2 million user tags. These data have extremely high commercial value for many industries and enterprises.
For example, in the online education industry, because the data on the payers of regular-priced courses is the core revenue data, many customers keep the data on the payers of regular-priced courses strictly confidential and cannot be output to a huge number of engine servers. However, paying for regular-priced courses is the core conversion and assessment objective of online education advertisements, and customers also hope to increase the rate of paying for regular-priced courses/renewing courses.
Based on Fedlearner, through the method of “Federated Learning – One-sided Feature Model of Massive Engine”, it is possible to provide user ID + features/labels on the giant engine side and the education client side respectively, and model the data together after the intersection of the data. This method has the ability to predict the conversion rate of in-depth regular-priced courses without knowing the deep behavior labels of users (that is, who bought regular-priced courses). Combined with the dynamic bid adjustment in the advertising refinement stage, it is possible to optimize the conversion rate of regular-priced courses in online education advertisements and improve the ROI of customer acquisition.
Based on the above solutions, ByteDance Federated Learning is currently exploring in-depth cooperation with a number of leading customers in the online education industry, helping education customers to increase their advertising volume by 124%, increase the number of regular-priced courses by 209%, and increase the renewal rate by 33.1%. The cost of acquiring customers for regular-priced course renewal users decreased by 11.7%.
3. Federated learning urgently needs to solve the conflict between security and efficiency
In Wu Di’s view, federated learning is still in the “early adoption” stage. Whether it is ByteDance’s Fedlearner or a series of federated learning platforms launched by domestic and foreign technology companies in the past two years, they are still facing various challenges. .
Based on ByteDance’s experience in developing the Fedlearner federated learning platform and ByteDance’s implementation of federated learning technology, Wu Di summed up the four major challenges of federated learning technology.
The first is security.
Federated learning naturally has unparalleled security advantages. In the past two years, countries around the world have attached great importance to data security and strict regulations, which is the driving force for the continuous and accelerated development of federated learning.
But in the context of machine learning, there are still some security challenges. For example, is it possible that the gradients in the training process allow collaborator A to guess the label distribution of collaborator B and thus leak statistics of user behavior? Similarly, is it possible for Party A to guess the characteristic distribution of Party B, or to use the intermediate products passed on by Party B (such as the output of activation) in other models? Some of these “security” problems can be solved by traditional means such as homomorphic/semi-homomorphic encryption, while others need to be solved by innovative machine learning algorithms and frameworks. For example, in order to further ensure data security on the client side, Fedlearner uses encryption algorithms to upgrade privacy protection for labels and Embeddings based on the federated learning framework and neural network model.
The second is efficiency.
Although ByteDance’s Fedlearner has achieved some results in scenarios such as recommendation, advertising, and user growth, Wu Di said that there are still severe challenges in large-scale data and training efficiency in these scenarios. For example, can the alignment and preprocessing of more than billions of rows of training sample sets be completed within hours without the data being visible to each other, and multiple rounds of all samples can be completed within hours through limited and unstable network connections Training, which is very demanding on the team.
In addition to challenges for teams, efficiency also presents challenges for safety. “The stronger the data security requirements are, the lower the data processing efficiency and training efficiency will be. It will also limit the choice of machine learning algorithms. A better balance and balance between security and efficiency is required,” said Wu Di, “The team New machine learning algorithms and frameworks are being explored, and it also covers many technical solutions such as the integration of software and hardware. At present, the entire industry also has a lot of targeted investment, and I believe that a lot of technological breakthroughs will be seen in the near future.”
The third is modeling capabilities and machine learning algorithms.
On various federated learning platforms including ByteDance Fedlearner, both sides of the federation cannot see the other’s original data, and sometimes both sides even hide their respective neural network structures. This approach ensures the security of data to a large extent, but from a technical point of view, it also increases the difficulty of “explainability and debug”.
Then, how to perform feature engineering processing without seeing the features of both sides, screen out important “good features”, how to troubleshoot the model effect, and iterate the model step by step to the extreme, which provides modeling for the federated learning technology team Ability, machine learning algorithm ability brings great challenges.
The fourth is friendliness.
A key word of federated learning is “cooperation”, and the more extensive the cooperation, the better the effect. This also means that the technical threshold for participating in federated modeling must be continuously lowered, which brings many challenges, including: rapid deployment (physical server/private cloud/various public clouds, etc.), easy access, one-click training and service capabilities.
These are the challenges faced by the entire industry and issues that need to be further studied and discussed.