How does VR achieve location tracking? | Hardcore Open Course

Location tracking technology is the core and most complex part of virtual reality devices. Good location tracking technology can ensure good immersion. However, it is not easy to do location tracking. It has very high requirements on hardware and algorithms. Currently, there are differences in the location tracking technologies that have been adopted in the industry. However, from the perspective of the existing product experience, there are still a lot of room for improvement in the current mainstream solutions.

So how does VR achieve location tracking? In the design process, how to weigh the indicators tracked in various locations? What issues should manufacturers pay attention to? This issue of Hardcore Open will answer questions for everyone.

Guest introduction: Zhang Haiwei, co-founder of Qingyi Vision , graduated from the State Key Laboratory of Pattern Recognition and Artificial Intelligence at the Institute of Automation, Chinese Academy of Sciences in 2008. His research interests include SLAM positioning, 3D reconstruction, panorama stitching, motion capture, and expression capture. In the field, there are three authorized invention patents. In August 2015, Shanghai Qing Hitomi Vision Technology Co., Ltd. was established to locate human-computer interaction technology in the field of artificial intelligence. At present, the core products are motion capture and expression capture, which are mainly used in the fields of film and television, animation, games, education, medical treatment, sports, military industry, virtual reality, and augmented reality. Qingyou Vision focuses on providing useful interactive technologies, reducing the cost of communication between people and computers, and making computers and humans friends.

Several confusing concepts of virtual reality

First of all, I want to clarify a few concepts. We usually mention these concepts, but when we mention them, we may say that what A thinks is B, so it is necessary to clarify it and avoid unnecessary confusion. The three concepts are "position tracking," "pose estimation," and "motion capture."

Location tracking

We live in a three-dimensional world, so "position tracking" refers to continuously and clearly identifying the position of an object of interest in three-dimensional space.

The key words here are "continuity" and "position", "persistent" is a good understanding, and "position" needs to be clearly defined. "Location" refers to the coordinates of the object in the three-dimensional world, that is, the coordinates of the object in the "X, Y, Z" directions. Location information is three degrees of freedom, or 3Dof.

What is the "object"?

The "object" may be our head, it may be a gun, or a chair. Heads, guns, or chairs, these objects all have a certain volume, and the "position" is a concept of a point. The "object's position" here actually refers to "a point on the object in the three-dimensional world. XYZ coordinates." This is very important. Especially for virtual reality, the experience brought by a little bit of deviation is likely to be very different.

Pose estimate

"Pose Estimation" is a very good explanation, it means "continuously and clearly interest in the position and rotation of rigid objects in three-dimensional space." The pose information is six degrees of freedom, or 6Dof. We need six degrees of freedom information in virtual reality instead of simple three degrees of freedom position information.

Unlike "position tracking," a point in space has no rotation information, so the rotation is for a three-dimensional object, which is not a point or a line. At the same time, the positions of the different points A and B on the same rigid body are different, but if A and B are respectively connected with the point C on the rigid body, the rotation angles of the line segments AC and BC with respect to the initial state are it's the same. Speaking of the initial state, we talk about the position, we must have a coordinate system origin, talk about rotation, we must have the origin of the state of rotation. For example, someone assumes that the prop level is considered to be a zero point of rotation. Someone puts the prop vertically and thinks it is a zero point of rotation (rotation angle is 0). If you wish, you can also think of the prop that is tilted as a zero point of rotation. There is only one point. The zero point of rotation of the model is consistent with the zero point of rotation of the actual item.

motion capture

The last is "motion capture." By default, we talk about motion capture is to capture the body movements. People have many joints. Each joint, such as the arm and the big arm, can be approximated as a rigid body. Therefore, the human motion has many degrees of freedom. In an actual motion capture application, the motion of a larger joint is generally captured, and the cost of capturing each muscle of each bone is relatively large or even unrealizable.

Mainstream tracking technology

The current mainstream technology is roughly divided into: optics, inertia, electromagnetics, machinery, UWB, and so on.

This is where the optics are the most complex, and it is also divided into several sub-genres, such as a classification method that can be divided into single camera single marker points (PS MOVE), single camera multiple marker points (PS VR, Oculus, HTC) according to the number of cameras and the number of marker points. Vive, SLAM), multi-camera single marker (Ximmerse, depth VR), and multi-camera multi-marker (green pupil, Optitrack). This is where the marker points are active and can be further divided into active markers and passive markers. According to the different camera exposure modes, it can be divided into a rolling shutter and a global shutter. Others are based on sensors such as piezoelectrics and sounds.

In general, the single-point solution in optics can only capture the 3Dof information of the rigid body in space. The multi-point scheme can capture the 6Dof pose information of the rigid body in space. The multi-camera multi-point scheme can capture the whole body. action.

In fact, it is well understood that a single marker point, such as a mathematically infinitely small ball, has only position information in space, and it rotates either as a ball or as a ball. If there are two balls on a rigid body, then the two balls are connected in a line, you can capture its five degrees of freedom, three balls and more can solve its six degrees of freedom.

Inertial and mechanical solutions cannot capture position information, and they cannot capture poses. They can only capture human body movements. The electromagnetic scheme captures poses and body movements. UWB is used to capture position information. Since each solution has its own advantages and disadvantages, it may be a hybrid solution in actual use.

The technical principle of infrared tracking

Infrared (multi-camera multi-marker) solution is widely adopted because of its comprehensiveness, and its principle is not so complicated, mainly based on triangulation (triangulation) mechanism.

Before capturing, the system needs to be calibrated (calibrated), calibrated with camera internal parameters (these parameters are only relevant to the camera itself), calibration and external parameters (these parameters are independent of the camera itself, related to the camera) calibration.

Some manufacturers have already calibrated the internal parameters before leaving the factory. Therefore, only the external parameters can be calibrated during use. This can ensure accuracy and calibration speed, but it loses flexibility. Therefore, most manufacturers use internal and external parameters to calibrate at the same time. .

With the internal and external parameters, the position of the marker point in space can be recovered through a triangle reconstruction method.

The camera photographing is a process of mapping from 3D to 2D. The depth information is lost, and the image is mapped onto an image point by the optical ray. After calibrating the internal reference, any point on the image can be reversed to find the direction of the ray in the three-dimensional space. The marked point must be on this ray. Then, if there are two rays, the two rays intersect to find the location of the marker in space. However, the marked point is generally in motion, so the two cameras required to be photographed need to be synchronized. If the marker point is still, you can use the same camera to take pictures at different positions. In the actual image, all the marker points are only one small white dot. This requires distinguishing these small white dots and doing a triangle reconstruction after each one.

Knowing the position information of the marker point, it is possible to calculate 6Dof of a set of marker points (generally referred to as a marker body or Rigid Body) or calculate the joint motion information of the human body.

In this process, it is necessary to use camera calibration, image extraction of marker points, identification and matching of marker points, recovery of depth information of marker points, 6Dof recovery of markers, and other factors such as occlusion and noise. These links seem simple, but there are certain requirements for technology. Any unsatisfactory link will lead to unsatisfactory results. In the end all these operations must be completed in a few ms or even 1ms.

Target point, camera's effect on location tracking

A big principle is that the more dispersed the camera and marker points, the higher the accuracy of the capture . A single camera like the Oculus, PSVR, or HTC Vive (here the Vive is categorized as a single camera, because its mathematical nature is the same, but there are some differences in data processing) uses a popular but less accurate metaphor "Close to big and small." Although this metaphor is inappropriate, it can explain some problems.

If you know the size of an object, you can judge how far it is from us by the size you see. When this object is closer to us, this judgment is more accurate, and when it is far away, it is not so reliable. For example, the distant mountain is difficult to estimate how far it is. In other words, when the target object is far away from us, its change in the depth direction (Z direction) (forward and backward movement) is not apparent in the image. But if we can also observe the object from another angle, then it is likely to move in the other image along the X, Y direction (left and right, up and down), then this change is much more obvious, so the more scattered between the camera The more accurate the recovered depth information is, the connection between the cameras is called the "baseline".

By the same token, it can be inferred that the scheme of Oculus or HTC cannot be far away, and the accuracy will be greatly reduced when the distance is far away (many people take it for granted to assume that the accuracy is very high). Of course, Light House uses time resolution to convert spatial resolution. It is a clever way to increase the resolution by at least one or two orders of magnitude at the same cost, so it is relatively better.

Similarly, we can also infer that the more dispersed the mark points are, the better the capture effect is, so if we measure accurately, we will find that HTC's helmet capture accuracy is higher than its handle's capture accuracy.

Finally, naturally, the more the number of redundant information is, the more accurate it will be under normal circumstances, but the increase is not linear, and it is also necessary to consider that this redundant information may also be noise, so it is sufficient to use it in actual use. .

Requirements for software systems and algorithms

The previous answer to the general algorithmic composition has already been answered. Here, the "field of view" is highlighted again.

The biggest drawback of optical capture is occlusion. If the angle of field of view of a camera is smaller, it is easier to be blocked. The smaller the capture range, the more dead angles that cannot be captured. If the camera's field of view angle is increased, the capture range becomes larger, and the range in which the users can respond to each other becomes larger, and the anti-blocking property is enhanced, and the dead angle is reduced. However, if the image resolution is not changed, the larger the angle of view, the fewer pixels the object occupies on the image, and the greater the distortion of the image, which means seeing a larger scene and thus more information. Need to deal with, so the size of the field of view can reflect the level of optical capture technology.

As far as the requirements for the system are concerned, it is actually better, mainly because of higher CPU and memory requirements. Because the emphasis is on real-time, it is generally difficult to use the GPU to accelerate because the data captured is constantly being processed. Because of the emphasis on real-time, algorithms must be able to be processed in parallel, and the relationship between optimization and speed must be weighed. Some system tracking is very good, but with more filtering, this will bring delays, may not feel at ordinary times, sometimes feel bad in VR.

Application scenario

Different applications may be different terrain, there may be obstacles, it may be indoor and outdoor use, or other scenes. In general, the layout of the camera is based on local conditions, to ensure that the entire tracking range is covered, and then each tracking range can be seen by at least two cameras.

In terms of coverage only, for infrared tracking, the difference is not great, as long as the range is particularly small, and the larger the range, the better. Since the cameras all have a certain viewing angle, the visible range is like a cone, and the closer to the camera the smaller the range is. Therefore, in a very small space environment, when the target is close to the camera and the camera has a small angle of view, the target to be captured can easily escape the capture range of the camera and cause the capture to fail, even in a small space. Will block the camera as a whole and cause the system to fail completely.

The solution is to increase the viewing angle of the camera as much as possible, but for the case where the camera is completely covered by the human body, optically regretfully shows that it is powerless. In this case, the space is quite small and can only be achieved by other means. Do the tracking.

The relationship between frame rate and delay

In fact, increasing the frame rate will reduce the delay. Just as mentioned above, the image processing will be put to the front and processed in real time through hardware. Our algorithm can theoretically achieve 1000 frames per megapixel image. Seconds. Of course, the real-time processing of data to 1000 frames is very demanding and the algorithm can only be improved. Or can a part of data offline processing, this is not the application of VR.

Performance

There are many indicators to evaluate the performance of a capture system. Personally think that the first need to ensure that the delay is low enough, we now have to understand the impact of delay on VR how much. Second, we need to ensure the stability of the capture. Finally ensure the accuracy of the capture, which is accurate.

Delay

Delay is a relatively complex concept. For virtual reality motion capture (all modes of motion capture), there are mainly data acquisition, data processing, data transmission, and data application. Several links introduce delays. Therefore, it is necessary to apply the right medicine to solve the delays caused by various links.

In the data acquisition phase, try to use a global exposure sensor chip (in fact, the Global shutter is more important because of the exposure mode will reduce the occurrence of motion blur for high-speed motion objects), in addition to shorten the exposure time, and then There is an increase in the frequency of data acquisition. For the camera, it is to increase the frame rate.

In the data processing session, on the one hand, the image processing part is front-end, that is, image processing is done on hardware , such as the MC1300 series of the Qing Hitomi. The image processing only delays the image capture by 2 pixel clocks, which can also be simply understood as the camera. The image processing was completed at the same time when the photograph was taken. So the delay of this block has been reduced to a minimum and it can be ignored.

In addition, when we summarize the information of each camera, we need further data processing. At this time, we must carefully design the data structure and framework, algorithm flow, etc. to ensure that the delay is as low as possible, such as when the camera frame rate is increased to 500 frames. At that time, all calculations must be completed within 2ms.

In the data transmission phase, the network communication protocol must be modified on the one hand, and on the other hand, a good network environment must be ensured. Speaking of this is still relatively pit, some time ago to participate in the exhibition, the wireless environment on the scene is really bad, resulting in a great delay, and later changed the power of 5G to ease, and this has to continue to improve.

Finally, the delay in the data application phase is also very important. The motion capture system is generally out of sync with the graphics card rendering. This means that after the capture data is sent to the computer, it may take several milliseconds or even the worst case to wait for 16.67/8.33 ms (motion capture data refresh rate is 60/120 Hz. Next) will be used by the graphics card. After the graphics card is finished rendering, it will go through the display process of the screen (it doesn't seem to have a double buffer, or delay), which will bring delays. What is bad is that these delays are beyond the control of the motion capture system. The solution is to make motion predictions. We will not silly send the raw data of the motion capture to the video card, but will make motion prediction according to the time of the game refresh, which will greatly reduce the delay.

Therefore, we often see someone who has the camera refresh rate as the equivalent of delay. This is actually not a rigorous approach, but also helpless, because we all need a parameter, and I want to say don’t look at the parameters, actually experience the most reliable . Also add that for VR, the data capture rate of motion capture is not as high as possible. Because these data are transmitted through the network, when the target of interest is a lot of rigid body, blindly increasing the refresh rate will cause a relatively large burden on the network, so 120Hz may be a more comfortable choice.

stability

On the one hand, the improvement of the stability needs to rely on redundant information of multiple cameras or multiple markers. On the one hand, it depends on the optimization degree of the algorithm. In addition, multi-sensor fusion, such as optical inertial mixing, is a matter of choice.

Accuracy

For optical capture, there are two main guarantees of accuracy:

One is to ensure the accuracy of image processing. Through simple calculations, we can know how much a certain sensor at a certain angle of view represents the physical size of a pixel at a certain distance. If one pixel represents a square with a side length of 1cm, then the accuracy may be 1cm. We can improve the resolution of the camera to improve the accuracy. For example, if the resolution is increased by 10 times, the accuracy will be 1mm. However, in practice, the accuracy of image processing can also be used to improve the capture accuracy. For example, if subpixel processing is used, if it can be accurate to 1/10 pixels, then it can reach a precision of 1mm. If it is accurate to 1/100 pixels, it is Sub-millimeter. Therefore, the accuracy cannot simply look at the image resolution.

In addition to the great impact on accuracy is the camera's calibration and correction of distortion (actually camera calibration). This calibration requires on the one hand that the algorithm and associated hardware are sufficiently accurate. It also requires that the person performing the calibration has more experience. This is one of the reasons that the optical system is difficult to use. The accuracy of the calibration may be poor for an inexperienced person. Ask for a more foolish approach. The last step is to increase the accuracy and stability through a reasonable camera layout and marking point layout.

Infrared light position tracking limitations

Generally speaking, the limitation of optical capture is that we first think of the problem of occlusion. This is indeed the biggest problem for optical solutions, but it is not so serious in practical use.

One view is that similar to the Oculus, HTC Vive this only requires a camera to track, so its anti-occlusion is relatively strong, but there is another side not mentioned, that is, this program requires a camera to capture more Only one marker can be used. This condition is in some cases difficult to satisfy. For multi-camera scenarios, each camera sees 1-2 marker points that are likely to eventually capture the target's pose. Therefore, good mark design and good algorithm processing can largely avoid occlusion problems. Unless there are too many obstacles at the scene, or if the corners of the local area are all captured without dead ends, this problem will not be too serious.

I think the biggest limitation of optical capture is price and ease of use, of which ease of use is the biggest problem. Because the price of hardware has always been related to quantity, if it reaches the magnitude of the current mobile phone, then the price will not be a problem. Of course, when you don't reach such a large number of levels, you can't give up the treatment. You will constantly improve the algorithm to reduce the dependence on the hardware. I believe there will be a low-cost solution in a short range (50 square meters) soon.

Ease of use is a big problem. Before the rise of virtual reality, optical capture was mainly used in the field of film and television animation. The number of dynamic arresters that can skillfully use this system in China may be around 10 people. On the one hand, the number of people is the reason for the high price, and the number of people who use it is too small. On the other hand, the system is too complicated. For foreigners who have various parameters on the software, people who do not know how to do it have to learn for a long time. This is acceptable for high-end film and television and other industries, but it will not work if you go offline or reach ordinary consumers. They want plug-and-play, stable and reliable.

One limitation of optics is that it cannot recognize an infinite number of target points of interest. Light house is distributed computing, so the number of rigid bodies can theoretically be as much as possible. However, this is only a theoretical consideration and it is actually necessary to consider occlusion and consider the transmission problem. It is possible to identify one or two hundred rigid bodies by optical methods, but to identify hundreds or even thousands of rigid bodies, we have not tested this yet. However, the algorithms are processed in parallel, so it is also necessary to distribute calculations.

In addition, passive optical motion capture is based on the distance between markers, so some people worry that there is an upper bound on this distance combination. This does not need to worry too much, as long as the distance is more than 5mm is considered a new distance, and the number of different distance arrangement combination is also huge. Finally, you can also use the active marker method to provide thousands of different marker points through the blinking mode of the LED. This arrangement can also be considered infinite.

Other limitations, such as being affected by light, are problematic outdoors, and this can be resolved if an active marker is used. The scope of optical capture is limited, unlike inertial capture, theoretically infinite.

Fusion with sensors

Because both optical and inertia have a very strong natural complementarity, the combination of the two to improve the capture of skills and can effectively reduce costs, so it is generally used in conjunction.

Actually, it is not only optical and inertial. Inertia is also used together with other sensors to match a variety of solutions.

The strength of optics lies in its ability to provide absolute position and orientation information, but it is easily obscured, the data tends to jitter, and the refresh rate is low (if the refresh rate is high, the cost increases dramatically). The strength of inertia is that the detection of dynamic data such as angular velocity and acceleration is very accurate and will not be blocked, the refresh rate is high, but the measurement position information will have an accumulated error, and the instability of the magnetic field also causes the inertia scheme to have no absolute direction information.

As for division of labor cooperation, there are mainly two kinds: one is simply providing position information by optics and inertia to provide direction information. The advantage of this scheme is low cost, and the disadvantage is that both optical and inertia defects are brought in; One is the fusion of optical and inertial data, rather than the simple data integration of 1+1=2. Light house is a very good solution for data fusion. In some cases, its optical information is only 30 frames, but the effect is high accuracy and data is smooth and stable.

Tracking technology trends

Now everyone is most optimistic about the depth of the camera program, from the ultimate form of technology to achieve, depth camera with inertia, can complete the SLAM, motion capture, object recognition and positioning and a series of functions, but the technical difficulty is still relatively large of.

People are very lazy animals, and entering the virtual world is to put aside the shackles of the real world. In the virtual world, everything is virtual, but your feelings must be true. The development of virtual reality technology is to re-present all the realities after digital processing. Here, if you regard the virtual world as the real world, then the virtual reality can be said to be successful.

From this perspective, in fact, virtual reality and augmented reality are very similar on the input side, and the difference is mainly on the display side. A good interaction that allows you to linger in the virtual world, feel your own body, touch the virtual world, and feel the feedback. Now that this can be done similarly, it is not natural enough, it is not stable enough, it is not easy to use, and it is not cheap enough. When these matured virtual reality will break out. Nowadays, like the Spring and Autumn Period and the Warring States Period, all the various schools of thought and philosophies contend with one hundred flowers.

Multi Layer Terminal Block

The JUK universal Screw Terminal Block series has the typical features which are decisive for practical applications:

l The universal foot allows the terminal blocks to be easily snapped onto the NS35 or NS32 DIN Rail with G shape.

l Closed screw guide holes ensure screwdriver operation perfect.

l For terminal block with different wire cross-sectional areas, complete accessories are available, such as end plates, partition plates, etc.

l Potential distribution achieved by fixed bridges in the terminal center or insertion bridges in the clamping space.

l Same shape and pitch Grounding Terminal Blocks as the JUK universal series.

l Adopt ZB marker strip system,achieve unified identification.

Terminal Block Connector,Din Rail Terminal Block,Din Rail Two Layer Terminal Blocks,Two Layer Terminal Blocks

Wonke Electric CO.,Ltd. , https://www.wkdq-electric.com

Posted on