Fake data, real implications
Image: Datagen

Fake data, real implications

Protocol Enterprise

Hello and welcome to Protocol Enterprise! Today: why synthetic data is changing how companies build AI models, cloud vendors celebrate Database Day, and the biggest Russian cyberthreats against U.S. businesses might be yet to come.

Spin up

Have we finally reached the beginning of the end of email as the primary office communication tool? According to new research from Spiceworks Ziff Davis, 51% of corporate IT workers say their employees prefer tools like Slack and Microsoft Teams over email for communicating with co-workers.

Fake it 'til you make it

Companies are building software that uses AI to monitor people’s behavior and interpret their emotions and body language in real life, virtually and even in the metaverse. But to develop that AI, they need fake data, and startups are stepping in to supply it.

Synthetic data companies are providing millions of images, videos and sometimes audio data samples that have been generated for the sole purpose of training or improving AI models that could become part of our everyday lives in controversial forms of AI such as facial recognition, emotion AI and other algorithmic systems used to keep track of people’s behavior.

  • While in the past companies building computer vision-based AI often relied on publicly available datasets, now AI developers are looking to customized synthetic data to “address more and more domain-specific problems that have zero data you can actually access,” said Ofir Zuk, co-founder and CEO of synthetic data company Datagen.
  • Synthetic data companies including Datagen, Mindtech and Synthesis AI represent a corner of an increasingly compartmentalized AI industry.
  • They produce AI parts that will eventually be assembled to build software, features in applications or systems used in vehicles.
  • They serve customers such as computer vision engineers and data scientists working for big tech giants, automakers, gaming companies or mobile phone makers.

Like so much polyester, synthesized datasets are intended to mimic the real thing.

  • Synthetic data does not just replicate actual photo and video data; it enhances it by adding dimensions and details that help AI-based systems learn.
  • Sometimes the synthetic stuff fills serious data gaps where real data does not exist or is difficult to obtain.
  • It might depict dangerous highway situations used to train autonomous vehicle AI, or include facial images representing people of multiple ethnicities or ages needed to help ensure AI makes fair and accurate decisions.

Many of these companies tout synthetic data as a panacea for the lack of diverse AI training datasets that has contributed to discriminatory AI, particularly facial recognition.

  • “We help customers reduce AI bias by providing synthetic data spanning a wide range of age, gender, BMI and ethnicity,” said Yashar Behzadi, CEO of Synthesis AI.
  • For an AI model to pick up on all the different possible signs of cheating in multiple environments involving a variety of people, it would need a large corpus of imagery showing hand, eye and body movements to learn from — the sort of images that could be too expensive to purchase, or force violations of privacy to obtain, even if there were enough of them.
  • “It becomes even more complex when you throw in facial key point data and skeleton pose data to train systems to understand which way the student’s gaze is going, which way their body is about to turn or which direction their hands are facing,” said Steve Harris, CEO of Mindtech, a company that offers a platform for designing and rendering images based on photorealistic computer graphics.

For AI to pick up on whether people are paying attention to the road — or to the boss during a meeting — it often needs to recognize facial expressions.

  • Synthesis AI’s datasets include minute distinctions among millions of images expressing as many as 150 facial “micromovements,” Behzadi said.
  • Customers use the company’s digital system to submit requests for custom data, then it automatically renders what they ordered.
  • “They’ll say, ‘I need a million images that span all these different dimensions,’” Behzadi said. The result might be thousands of facial images with a variety of skin tones, hair styles or features like hats or glasses.
  • According to founder Rana el Kaliouby, Affectiva has used synthetic data to increase the diversity of its dataset representing people across age ranges and ethnicities.

But as synthetic data companies push a diversity mission, their products may be used to build contentious forms of AI.

  • The legitimacy of emotion AI has been questioned by researchers who say neither humans nor machines can accurately detect people’s emotions based on facial expressions.
  • And in general, many also believe that algorithmic systems monitoring people’s facial expressions or how they walk or talk perpetuate unnecessary surveillance and could be used to unfairly penalize people.
  • However, in some cases, synthetic data suppliers remain a step removed from the products that will be manufactured using their data.
  • Instead of providing qualitative labels categorizing facial expressions as confused or bored, Synthesis AI only annotates facial images with technical information. An image label might include metadata stating that the left side of the mouth moved upwards 10 degrees, but would not come pre-labeled as “slightly happy,” for instance.

While Behzadi said Synthesis AI has turned down work with customers that wanted to use its data to identify people without their consent, he said the company has not turned down potential customers that want data to train emotion AI models.

  • Expect more synthetic data creation in the near future as it forms the foundation of all sorts of AI built for emerging virtual environments.
  • “There is the potential for synthetic data to be a prominent tool for metaverse companies,” said Harris.
— Kate Kaye (email| twitter)


Join SAP for its flagship Sapphire event from the comfort of your home. Hear from customers and SAP experts on how you can transform your business and disrupt your industry. It's free for you and your teams to attend.

Learn more

All’s fair in love and databases

There was a rush of open-source database news today — with a couple sides of one-upmanship — via announcements from Google Cloud, MariaDB and Cloudflare.

Google Cloud boasted that its new AlloyDB for PostgreSQL – a fully managed database service for top-tier relational database workloads now in preview – processes transactional workloads four times faster than standard PostgreSQL, the open-source relational database. It was up to 100 times faster for analytical queries, according to Google Cloud.

AlloyDB was also said to be two times faster for transactional workloads than Amazon Aurora, AWS’ comparable relational database engine and one of its fastest growing services ever, according to Google Cloud. And unlike AWS with Amazon Aurora, Google Cloud will not charge for input and output operations related to AlloyDB. Those transfers can account for up to 60% of a bill for transactional workloads, a Google spokesperson said.

Open-source database company MariaDB threw down a $25,000 wager, challenging companies to compare other distributed SQL databases to Xpand, its distributed SQL database. If Xpand fails to demonstrate better throughput and latency than another distributed SQL database in a proof of concept, MariaDB will donate the money to a nonprofit or award it to offset a company’s infrastructure costs for running the test.

“We’re confident that MariaDB Xpand outperforms the early-stage distributed SQL offerings like CockroachDB, and we’re ready to put it to the test,” Robbie Mihalyi, senior vice president of Engineering for MariaDB, said in a statement.

Cybersecurity and internet infrastructure company Cloudflare, meanwhile, unveiled D1, its first SQL database. Built on the open-source SQLite, the instant, serverless cloud-storage database will allow developers to build database-backed applications using its serverless Cloudflare Workers development platform. Cloudflare will use its global network to automatically store customers’ databases as close as possible to their users, the company said. Private beta access is expected to start in June.

— Donna Goodison (email | twitter)

No rest for the (cyber) weary

Hate to break it to you, but the lack of crippling cyberattacks in Ukraine doesn't mean we're off the hook in the U.S.

That's according to Jonathan Reiber, formerly a top cyber policy strategist in the Obama administration — who says it's likely that Vladimir Putin is still coming for Western nations on the cyber battlefield. (Sorry.) It's also probable that we're dealing with a lack of information, rather than attacks, in Ukraine, Reiber told Protocol.

"People like to say that Ukraine is not under attack in cyberspace" in a major way, he said. "I don't think we know that for certain. And I think it could be the opposite." As one indicator, Reiber cited a recent report from Microsoft, disclosing that there have actually been 237 cyber operations launched against Ukraine since Russia's invasion in late February.

"Why hasn't there been a massively disruptive cyberattack? It may be that there have been attacks — we just don't know [about them] yet," Reiber said.

In other words, keep your "Shields Up," as they say. Putin may not have launched massive cyberattacks against the West in connection with Ukraine so far, "but we know that he will," said Reiber, who is now vice president of Cybersecurity Strategy and Policy at security validation firm AttackIQ.

— Kyle Alspach (email | twitter)

Around the enterprise

IBM signed what it called a “first of its kind” deal with AWS to make a broad swath of its software available on AWS, in what might be the beginning of the end for IBM Cloud.

If you don’t like Rackspace’s ownership structure, just wait a couple years. The company is once again looking for private investors after going public two years ago, going private six years ago, and going public for the first time in 2008.

Intel introduced Project Amber, a software service designed to extend trusted execution environments across multiple clouds.

ServiceNow unveiled a new tool that allows different parts of an IT service team to collaborate on issues more directly.


Join SAP for its flagship Sapphire event from the comfort of your home. Hear from customers and SAP experts on how you can transform your business and disrupt your industry. It's free for you and your teams to attend.

Learn more

Thanks for reading — see you tomorrow!

Recent Issues